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Abstract 

This  thesis  extends  existing  modeling,  analysis  and  comparison  of  interconnec¬ 
tion  networks  for  parallel  processing  systems.  Simulation  models  are  developed  for 
the  multistage  cube  network,  the  single  stage  cube  network  (hypercube),  and  the 
Illiac  IV  mesh-type  network.  They  are  then  used  to  provide  a  comparison  of  tliree 
classes  of  interconnection  netw’orks  which,  until  now,  has  not  been  performed.  These 
models,  implemented  using  a  commercially  packaged  simulation  language  pro\  ide 
for  compact  source  code  and  ease  of  readability.  The  networks  are  modeled  under 
a  common  set  of  operating  assumptions  and  system  environment.  This  allows  for 
accurate  comparisons  of  average  network  packet  delays  and  memory  requirements 
necessary  to  physically  implement  the  chosen  network  at  a  given  network  operating 
load.  It  is  concluded  that,  for  the  network  sizes  and  operating  conditions  established, 
the  multistage  cube  network  performs  better  at  a  lower  hardware  cost  than  do  the 
single  stage  cube  and  mesh  networks.  As  a  result,  the  designer  of  a  parallel  pro¬ 
cessing  system  is  given  additional  insight  for  choosing  an  interconnection  network 
which  best  suites  the  application  needs.  This  thesis  investigation  is  summarized  in 
[RaD88a].  and  [RaD88b]. 
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1.  Introduction 


1.1  Background 

Since  the  ad  -ent  of  the  modern  computer,  computer  architects  have  continu¬ 
ously  attempted,  and  in  most  cases  succeeded,  in  designing  and  implementing  faster, 
more  powerful  systems.  Due  to  design  innovations  and  technological  advances,  com¬ 
puter  systems  have  rapidly  expanded  and  diversified  from  the  uniprocessor,  von 
Neumann  architecture  [Von46].  In  the  relatively  short  period  of  fort}'  years,  com¬ 
puting  capabilities  have  increased  from  performing  simple  mathematical  functions  to 
those  of  performing  complex  numerical  and  artificial  intelligence  applications.  Tra¬ 
ditionally.  the  computational  power  needs  of  society  have  exceeded  the  proces.^ing 
power  of  contemporary  computer  systems.  For  this  reason,  ongoing  research  is  inves¬ 
tigating  and  proposing  possible  ways  of  meeting  or  exceeding  the  processing  needs 
of  society. 

Processing  tasks  such  as  weather  forecasting,  ballistic  missile  defense,  image 
processing,  air-traffic  control,  pattern  and  speech  recognition,  medical  diagnosis, 
and  robotic  vision  are  issues  that  concern  society.  These  tasks,  due  to  their  nature, 
require  the  computational  speeds  of  the  host  machine  to  be  near  or  at  real-time 
speeds. 

Due  to  the  computational  complexity  of  the  algorithms  required  to  implement 
the  above  named  tasks,  real-time  processing  is.  in  some  ca.ses.  not  fea.sible  in  a 
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uniprocessor  environment.  The  real-time  constraints  imposed  by  these  tasks  require 
the  computer  architect  to  look  for  ways  of  expanding  from  the  uniprocessor  \  on 
Neumcinn  architecture  to  accomplish  these  taisks. 

In  many  cases,  the  uniprocessor  von  Neumann  architectures  lack  the  ability 
to  perform  real-time  applications.  This  is  due  to  their  sequential  mode  of  operation 
and  largely  to  the  computational  complexity  of  the  algorithms  executed.  In  the  von 
Neumann  machine,  instructions  must  be  executed  in  a  sequential  manner.  Dc.sigii 
techniques  which  incorporate  overlapping  and  pipelining  of  instructions  allow  for  exe¬ 
cution  times  to  be  greatly  reduced.  These  techniques,  coupled  with  the  technological 
advances  in  very  large  scale  integrated  (VLSI)  circuit  design  have  brought  single  pro¬ 
cessor  computational  speeds  to  near  the  speeds  necessary  for  real-time  applications. 
To  achieve  the  real-time  processing  capabilities,  concurrent  processing  must  take 
place.  Parallel  processing  systems,  comprised  of  many  cooperating  processors,  have 
been  developed  in  effort  to  achieve  the  real-time  processing  capabilities. 

One  of  the  major  concerns  in  designing  parallel  processing  systems  lies  in  de¬ 
termining  the  interconnection  scheme  of  the  multiple  processors.  The  designer  must 
consider  the  application  and  the  number  of  processors  to  be  used  when  choosing 
an  interconnection  network.  Numerous  interconnection  networks  have  been  imple¬ 
mented  to  link  multiple  processors  to  achieve  a  parallel  computing  environment 
[Fen8l].  The  goal  of  the  interconnection  network  design  is  to  have  the  communica¬ 
tion  time  between  processors  be  substantialh"  less  than  the  processing  time  required 
by  an  individual  processor  to  execute  an  instruction  or  set  of  instructions. 

Further  design  issues  exist  once  an  interconnection  network  topology  has  been 
chosen  for  study.  First,  the  switching  methodology  must  be  determined.  The  switch¬ 
ing  methodology  determines  the  way  in  which  a  message  is  to  be  routed  through  the 
switching  elements  of  the  network.  Four  methods  exist:  circuit  switching  (dedicated 
physical  transmission  paths  are  set  up  and  held  until  completion  of  transmission): 
packet  switching  (transmission  paths  are  dynamically  allocated  and  released  u]ion 
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the  receipt  and  passage  of  the  data  packet);  virtual  cut  through  (packet  headers 
are  examined  and  forwarded  to  the  next  appropriate  channel  before  receipt  of  en¬ 
tire  packet)  [KeK79];  and  wormhole  routing,  a  derivative  of  virtual  cut  througli 
(blocked  packets  remain  in  the  network  instead  of  being  buffered  as  in  virtual  cut 
through)  [Dal86]. 

The  control  strategy  of  the  network  switching  elements  is  another  design  issue 
that  must  be  examined.  Two  types  of  control  strategies  exist;  control  which  is 
distributed  at  each  switching  element  and  a  centralized  control  which  is  used  for  all 
switching  elements.  Trade-offs  exists  for  using  either  of  the  two  control  strategies. 
Chapters  2  and  3  provide  a  more  detailed  discussion  of  the  control  strategies  as  well  as 
switching  methodologies  and  interconnection  network  design  and  implementations. 

As  the  number  of  processors  in  a  parallel  system  has  grown,  so  has  the  complex¬ 
ity  of  analytical  approaches  to  modeling  these  systems.  In  most  cases,  it  is  no  longer 
feasible,  and  in  few  cases  possible,  to  determine  the  expected  performance  of  a  mul¬ 
tiprocessor  design  using  only  mathematical  techniques.  As  a  result,  many  designer.'^ 
of  multiprocessor  systems  have  turned  to  modeling  their  systems  through  cominilcr 
simulations.  Computer  simulation  can  provide  a  low-cost  and  time-efficient  method 
for  systems  modeling.  Simulation  can  provide  supplemental  information  which  can 
be  used  to  validate  mathematical  models  when  available.  Through  simulation,  it  is 
possible  to  compare  the  performance  of  dissimilar  designs  and  determine  the  design 
best  suited  for  a  particular  application. 

1.2  Research  Goals 

While  many  research  efforts  have  examined  the  performance  of  particular  net¬ 
works  under  differing  environments  [DaS86,  DiJ81.  KrS83,  Law75,  PatSl.  Pea77j. 
few  have  performed  comparisons  of  dissimilar  interconnection  networks  under  the 
same  environment  [AbP86,  Dal86].  Of  the  research  efforts  which  have  performed 
ijiterconnection  network  compari.sons  for  the  same  operating  conditions,  their  scope 


has  been  limited  in  the  number  of  differing  networks  compared  [AbP86.  DalSG]  and 
in  the  sizes  of  the  switching  elements  which  connect  the  multiple  processors  [AbP86]. 
For  this  reason,  the  main  objective  of  this  thesis  investigation  is  to  expand  upon  the 
performance  comparison  b«ise  by  examining  three  types  of  interconnection  networks: 
the  single  stage  cube  network,  the  multistage  cube  network  and  the  Illiac  I\’  mesli 
network.  This  performance  comparison  is  made  for  network  sizes  capable  of  support¬ 
ing  64  to  1024  autonomous  processors.  This  range  is  chosen  due  to  the  technological 
limitations  which  presently  exist  in  supporting  complex  processors.  The  physical 
structure  and  interconnection  functions  of  these  three  interconnection  networks  are 
discussed  in  detail  in  Chapter  2. 

One  of  the  figures  of  merit  that  this  investigation  concentrates  on  is  the  av¬ 
erage  delay  incurred  by  a  message  as  it  traverses  the  network.  By  comparing  the 
average  delay  experienced  by  messages  entered  into  the  networks,  for  various  net¬ 
work  loading,  a  determination  of  the  desirability  of  one  network  over  another  can  be 
made.  Added  information  in  determining  the  performance  of  the  three  networks  is 
gained  through  the  knowledge  of  the  maximum  queue  lengths  associated  with  each 
network  for  a  given  network  loading.  This  gives  the  designer  insight  into  the  cost 
of  constructing  a  network  if  the  average  delay  is  the  most  important  performanre 
parameter  considered. 

1.3  Summary 

This  chapter  hats  presented  an  overview  of  the  processing  restrictions  incurred 
by  using  the  traditional  von  Neumann  architecture.  Methods  for  overcoming  these 
restrictions  can  be  realized  by  using  parallel  processing  techniques.  One  of  the 
underlying  problems  a.ssociated  with  parallel  systems  is  in  the  determination  of  how 
the  multitude  of  processors  will  be  connected  to  one  another.  This  investigation 
examines  three  topologies  of  interconnection  networks. 


In  Chapter  2,  an  overview  of  parallel  processing  systems  is  presented.  A  de¬ 
tailed  discussion  is  presented  on  the  classification  of  parallel  systems.  This  discussion 
is  followed  by  an  examination  of  interconnection  networks,  the  characteristics  asso¬ 
ciated  with  message  transmission  and  the  control  methodologies.  A  brief  overview 
of  contemporary  parallel  systems  is  presented  with  relation  to  their  interconnection 
network,  switching  methodology,  and  control  implementations. 

Chapter  3  examines  previous  performance  modeling  and  analysis  research  that 
has  been  performed.  These  studies  examine  both  the  analytic  and  the  simulation 
modeling  of  interconnection  networks.  A  discussion  of  the  modeling  and  analysis  of 
the  switching  methodologies  of  the  netw’ork  is  presented,  comparing  circuit  switching, 
and  packet  switching.  The  latter  sections  of  Chapter  3  discuss  the  present  state  of 
network  modeling  and  analysis.  These  sections  examine  the  comparison  of  dissimilar 
network  topologies. 

Chapter  4  presents  the  methodology  applied  to  solving  this  investigation.  The 
simulation  methodology,  validation,  and  performance  analysis  of  the  three  intercon¬ 
nection  network  models  are  presented  in  Chapter  5.  Conclusions  and  recommenda¬ 
tions  for  future  research  are  presented  in  Chapter  6. 


2.  Parallel  Processing  Systems  Overview 


2.1  Introduction 

In  this  chapter,  the  underlying  characteristics  of  parallel  processing  systems 
will  be  reviewed.  The  basic  understanding  of  these  characteristics  is  essential  to 
the  comprehension  of  the  contemporary  concurrent  processing  systems  and  the  ar¬ 
chitectural  problems  which  they  face.  In  Section  2.2.  parallel  processing  systems 
classification  methodologies  will  be  presented.  Section  2.3  will  review  the  major 
classes  of  interconnection  networks.  Contemporary  parallel  processing  systems  will 
be  discussed  in  Section  2.4  with  relation  to  the  interconnection  networks  discussed 
in  Section  2.3. 

2.2  Parallel  Processing  Systerns  Classification  Methodologies 

The  ability  to  accurately  classify  computer  systems  at  the  systems  level  is  a 
problem  that  has  plagued  computer  architects  since  the  inception  of  the  von  Neu¬ 
mann  machine.  Three  taxonomies  have  been  recognized  as  viable  tools  for  use  in 
reducing  this  problem. 

2.2.1  Flynn's  taxonomy  In  1966,  Michael  Flynn  [Fly66]  proposed  a  method 
for  classifying  computer  systems  based  on  the  number  of  instruction  and  data  strf  arm- 
associated  with  the  system.  The  term  stream  is  used  to  denote  the  sequence  of  items 
(instructions  or  data)  that  are  either  executed  or  operated  upon  by  a  processor 
contained  in  the  machine.  From  this  concept  of  streams.  Flynn  proposed  that  a 
machine  could  be  classified  into  one  of  four  categories.  These  categories  are  as 
follows; 

•  Single  Instruction  stream  -  Single  Data  stream  (SISD) 

•  Single  Instruction  stream  -  Multiple  Data  stream  (SIMD) 
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•  Multiple  Instruction  stream  -  Single  Data  stream  (MISD) 

•  Multiple  Instruction  stream  -  Multiple  Data  stream  (MIMD) 

Figure  2.1  represents  the  four  categories  of  Flynn's  taxonomy. 

The  SISD  machine  represents  the  traditional  von  Neumann  architecture.  This 
machine  is  characterized  by  a  single  processor  which  uses  a  single  instruction  stream 
and  a  single  data  stream  for  its  operation.  The  MISD  machine  is  characterized 
by  multiple  instructions  streams,  supplied  by  multiple  processors,  which  operate 
on  a  single  data  stream.  In  theory,  a  MISD  machine’s  multiple  processors  operate 
concurrently  on  a  single  stream  of  data.  At  present,  no  true  MISD  machines  exist 
[HwB84].  Both  the  SIMD  and  MIMD  machines  are  classified  as  parallel  processing 
systems.  A  SIMD  machine  is  characterized  by  a  single  instruction  stream  which  is 
spawned  off  to  multiple  processors,  each  of  which  retains  its  own  data  streams.  The 
single  instruction  stream  allows  for  a  “lock-step”  (sequential)  instruction  e.vecution. 
The  MIMD  machine  is  one  whose  characteristics  are  truly  parallel.  An  instruction 
stream  is  associated  with  each  of  the  multiple  processors  in  the  system.  This  allows 
for  the  concurrent  operation  and  execution  of  instructions. 

2.2.2  Feng's  Taxonomy  T.’t'.  Feng's  taxonomy  [Fen72]  attempts  to  compare 
computer  systems  by  computing  their  degree  of  parallelism.  From  the  results  of 
these  computations,  Feng  proposes  that  the  processing  power  of  a  system  can  b(' 
quantified.  The  degree  of  parallelism  represents  the  maximum  number  of  bits  per 
unit  time  that  the  system  can  process.  To  describe  the  measure  of  parallelism.  Feng 
uses  the  ordered  pair  (n,  m).  where  n  is  the  processor  word  length  and  m  is  the  system 
bit-slice  length.  Thus,  systems  can  be  cl£issificd  as  word  serial/parallel  (n  =  1  or 
n  >  1)  and  bit  serial/parallel  (m  =  1  or  m  >  1).  By  computing  the  product  of  n  and 
m.  the  degree  of  parallelism  of  the  system  can  be  used  for  performance  comiiari^ou" 
of  differing  architectures. 
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Figure  2.1.  Flynn's  classification  taxonomy,  (a)  SISD:  (b)  MISD;  (c)  SIMl) 


2.2.3  Handler's  Taxonomy  The  taxonomy  of  Wolfgang  Handler  [HanTTj.pit'scu 
another  method  for  determining  the  classification  of  a  computer  system.  Handler's 
taxonomy  is  an  attempt  to  classify  a  system  by  its  degree  of  parallelism  and  its 
pipelining  capabilities.  Pipelining  is  defined  as  the  system's  ability  to  decompose  a 
process  into  distinct  subprocesses  which  may  be  executed  in  an  overlapped  manner. 
Under  Handler’s  proposal,  a  system  can  be  represented  by  a  triple.  T{C).  which 
contains  six  independent  variables.  T{C)  is  defined  as  follows: 

T[C)  =<  K  X  A".  D  X  D'.  U'  x  H"  >  (2.1  i 

where 

K  is  the  number  of  processor  control  units  (PCI  ) 

K'  is  the  number  of  PCUs  that  can  be  pipelined  together 
D  is  the  number  of  arithmetic  logic  units  (ALU)  per  PCU 
D'  is  the  number  of  ALUs  that  can  be  pipelined  together 
lU  is  the  basic  wordlength  of  the  ALU 
lU'  is  the  number  of  pipeline  stages  in  the  .ALU 

The  significance  of  Handler's  taxonomy  is  that  it  introduces  the  concept  of  pipelining 
as  a  classification  measure. 

2.3  Interconnection  .\etuorks 

In  a  multiprocessor  environment,  the  ability  of  a  particular  irroccssor  to  com 
municate  with  other  processors  in  the  system  is  dependent  upon  the  topolog\-  of 
the  network  which  connects  them  and  the  interprocessor  communication  switching 
methodology.  Interconnection  networks  can  range  from  simple  and  inexpensixe  to 
complex  and  cost  prohibitive.  The  most  simple  (logically)  interconnection  netwnik 
is  the  ring.  The  complexity  of  a  ring  is  proportional  to  the  number  of  processors 


in  the  ring,  0(n).  As  its  name  implies,  the  ring  interconnection  network  forms  a 
closed  looped  by  connecting  neighboring  processors  in  a  uni-  or  bi-directional  ring. 
The  communication  time  in  the  ring  is  a  function  of  the  number  of  processors  in  the 
ring.  As  a  result,  systems  implemented  using  the  ring  interconnection  topology  are 
severely  limited  in  the  number  of  processors  which  can  be  connected.  Therefore,  tin- 
ring  interconnection  can  be  feasibly  applied  only  when  the  number  of  processors  to 
be  connected  is  small.  On  the  other  end  of  the  complexity-cost  spectrum  lies  the 
crossbar  switch.  The  crossbar  switch  is  characterized  by  n  inputs  and  n  out  [ml'. 
An  n-by-n  crossbar  switch  allows  for  full-connectivity  between  its  n  inputs  and  n 
outputs.  As  a  result  of  h='ving  the  capability  of  routing  any  input  to  any  output, 
the  benefits  of  connectivity  must  be  paid  for  in  logic  complexity  and  high  cost.  Tln' 
high  cost  results  from  a  circuit  complexity  which  is  proportional  to  the  square  of 
the  number  of  processors  and  memory  devices  connected  to  its  input /out  [)ut  ])urts. 
O(n^).  To  overcome  the  restrictions  of  small  sized  systems  inherently  related  with 
ring  interconnection  networks,  and  cost  prohibitive  systems  implemented  solely  witli 
crossbar  switches,  design  compromises  have  to  be  made.  As  a  result  of  these  compio 
mises.  two  classes  of  interconnection  networks  are  l)eing  designed  and  constructed  tu 
allow  for  large  numbers  of  processors  at  a  reasonable  cost.  These  two  interconnect  i.  .n 
networks  classes  are  the  direct  or  single-stage  networks,  and  the  indirc'ct  or  multi 
stage  networks.  Direct  networks  use  point-to-point  links  to  connect  the  proee^ini! 
elements.  Indirect  networks,  on  the  other  hand,  uses  the  network  as  a  separate  en 
tity.  The  processing  elements  are  connected  to  th<‘  inputs  and  the  outputs  of  tie 
net  work. 

Four  methods  of  interprocessor  communications  exist:  nrcuit  sinti  hmij.  ptu  l.-i  ' 
switching,  virtual  cut  through,  and  wonnholt  routing.  The  first  type,  circuit  switch¬ 
ing,  is  where  a  dedicated  path  is  established  prior  to  the  transmission  of  data  fiom 
source  to  deslinat ion.  In  circuit  swit c  hing.  the  dedicated  [cat  ti  is  held  until  the  t  r.u;' 
mission  of  data  is  complete.  I  he  second  ty[)e  of  swit ching  is  [)a(  ket  su  ite  hum  I'acki  i 
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switching  is  a  concept  in  which  messages  are  broken  into  submessages  (packets)  and 
the  packets,  along  with  their  routing  information  are  allowed  to  independently  tra¬ 
verse  across  the  network  from  input  to  output.  In  virtual  cut  through,  the  packet 
headers  are  examined  to  determine  the  next  appropriate  channel  for  the  packet  to 
be  transmitted  on.  Virtual  cut  through  uses  a  store  and  forward  method  of  trans¬ 
mitting  the  packets  once  the  packet  header  has  been  examined.  If  a  blockage  exists, 
the  packet  is  buffered  until  the  blockage  has  been  resolved.  Wormhole  routing  uses 
the  same  basic  approach  of  examining  the  packet  header  as  virtual  cut  through.  The 
two  methods  differ  in  how  the  packet  is  handled  when  a  blockage  is  encountered. 
Instead  of  buffering  the  packet  as  done  by  virtual  cut  through,  wormhole  routing 
keeps  the  packet  in  the  network  until  the  blockage  is  resolved. 

2.3.1  SingU  -Ftagf  .\€tu'orks  The  single-stage  network  is  considered  to  be 
a  dynamic  network  with  a  collection  of  n  input  selectors  and  n  output  selectors 
A  dynamic  network  can  be  described  as  a  network  which  has  the  ability 
to  reconfigure  its  interconnection  links.  The  manner  in  which  these  links  are  re¬ 
configured  is  dependent  upon  the  implementation  of  the  interconnection  function. 
Kxarnples  of  single-stage  networks  are  the  Illiac  I\'  [BaB68],  the  Shuffle-Exchange 
[StoTl].  the  I’M2I  [Sie85].  and  the  Cube  [Sie8.')].  Subsubsection  2. 3. 1.1  discussf'^ 
the  Ill  iac  I\'  interconnection  function  and  physical  layout.  1  he  C'ube  network  is 
examined  in  the  same  light  in  Subsubsection  2. 3. 1.2. 

2. 3.1. 1  Th(  niiac  I\  Intfrconiuciwn  Sdnork.  1  he  Illiai  1\  netwuik 
received  its  name  from  the  SIMU  machine,  the  Illiac  1\  .  designed  in  the  late  luuiu 
and  early  1970s  [BaBG8].  The  Illiac  1\'  network  has  a  physical  layout  whn  h  is 
approximately  equivalent  to  a  two-dimensional  mesh.  The  Illiac  1\  networ  k  differ^ 
from  a  mesh  network  in  that  the  bord«T  processing  elements  are  (oiinected  ni  a 
"wrap-around"  fashion.  Figure  2.2  shows  the  physical  la>out  of  a  Illiac  I\  netvMcrk 
where  the  number  of  processing  elements  is  ecpial  to  Ifi. 

1 1 


The  physical  interconnection  of  the  Illiac  IV  processing  elements  is  based  on 
four  interconnection  functions.  These  functions  are  as  follows: 


a 


Illiac+^{P)  =  {P  +  l)modN 
Illiac^i{P)  —  {P  —  l)TnodN 
Illiac^n{P)  =  {P  +  n)modN 
Illiac-niP)  =  {P  +  n)modS 


where 


ii 


P  is  the  processor  identification  number 

.V  is  the  number  of  processors  in  the  Illiac  IV  network 

7)  is  the  \/S 


mod  is  modulus  arithmetic 

2. 3. 1.2  Th(  Cub(  Interconnection  Network  The  Cube  network  [Sii'77] 
is  a  single-stage  interconnection  network  whose  name  is  derived  from  its  processing 
elements'  physical  interconnection  pattern.  The  dimension  of  the  cube  is  determined 
by  the  number  of  processing  elements  in  the  cube.  As  an  example,  let  .V  be  the 
number  of  processing  elements  in  the  cube.  The  dimension  of  the  cube,  rn.  is 
In  an  r7i-dimensi(,)nal  cube,  the  processing  elements  are  locat(>d  on  tlie  vertices  of  the 
cube.  Each  processing  element  is  connected  to  m  adjacent  processing  el('ment>.  .\ 
3-dimensional  cube  is  shown  in  Figure  2.3.  The  interconnection  of  the  piocessini: 
elements  in  an  m  dimension  cube  can  be  described  bv  m  interconnection  functions. 


Cubek{pm-\-Pm-2-  ■Pi. Pol  =  Pt,.  -  1  •  p,,.  -  2 .  •  •  Pk .  •  •  Pi .  J'o  .  0  <  P  <  m 
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Pigure  2.3.  1  hr  Cube  interconnection  network  where  m  =  3.  N  =  8. 

u  hi  ir  /),  is  the  /'*'  bit  of  a  processing  element's  address.  The  CuIh  k  function  connects 
a  particular  processing  element,  represented  on  the  left  side  of  the  equation,  to  a 
prucf.ssinc  element  gi\en  on  the  right  side  of  the  equation.  The  two  processint: 
elements'  addresses  differ  in  the  k’^'  bit  position. 


2.3.2  Multistage  Networks 

Like  the  single-stage  networks,  multistage  networks  are  dynamic  netwuiks. 
Multistage  networks  can  be  described  by  three  characterizing  features:  the  switching 
titr/itnt,  the  network  topology,  and  the  control  structure  [HwBS-t].  Llie  swiichiii't 
•  lenient  is  a  device  whose  function  is  to  interchange  its  ji  inputs  and  p  outjiuts. 
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As  an  example,  a  2-by-2  switch  box  has  four  allowable  settings:  straight,  cxchangf.. 
upper  broadcast,  and  lower  broadcast,  as  shown  in  Figure  2.4. 

Multistage  networks  can  be  one-  or  two-sided.  One-sided  networks  are  those 
whose  input  and  output  ports  are  on  the  same  side  of  the  network.  Two-sided 
networks  have  inputs  on  one  side  of  the  network  and  outputs  on  the  other  side. 
Two-sided  multistage  networks  can  be  categorized  into  three  classes:  blocking,  rrar- 
rangtable,  and  nonblocking  [FenSl]. 

Blocking  networks  are  those  in  which  the  connection  of  more  than  one  terminal 
pair  simultaneously,  may  cause  conflicts  in  the  allocation  of  the  remaining  communi¬ 
cation  links.  Examples  of  blocking  networks  are  the  Data  Manipulator  [FenSl],  the 
Omega  [Law7.5],  and  Indirect  Binary  n-cube  [Pea77]. 

Rearrangeable  networks  are  those  which  can  perform  all  possible  combinations 
of  connections  between  inputs  and  outputs  by  rearranging  existing  connections  to 
allow  for  new  input-output  connections.  The  Benes  network  [Ben65]  is  an  example 
of  a  rearrangeable  network. 

The  third  class  of  two-sided  multistage  networks  is  the  nonblocking  network. 
In  a  nonblocking  network,  there  exists  a  one-to-one  connection  between  input  and 
output  port.  The  crossbar  switch  which  provides  full-connectivity  between  inputs 
and  outputs  is  an  example  of  nonblocking  network.  A  multistage  netwo  x  with  .V 
processing  elements  will  contain  at  least  logp.X  stages,  where  p  is  the  size  of  the 
crossbar  switching  box.  Each  stage  of  the  network  will  consist  of  .\ / p  switching 
boxes. 

A  third  feature  used  in  characterizing  multistage  networks  (the  switching  ele¬ 
ments  and  the  network  topology  being  the  first  two)  is  the  control  structure.  TIk' 
control  structure  of  the  network  determines  how  the  switching  elements  arc'  to  be 
controlled.  There  exists  two  basic  methods  for  implementing  the  control  structure: 
distributed  OT  centralized  control.  In  a  distributed  control  structure,  each  switching 
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.4.  The  four  settings  of  a  2-bv-2  switch  box 


box  contains  control  logic  which  uses  routing  information  contained  in  the  header  of 
a  message  to  determine  the  setting  of  the  switching  box.  Centralized  control  use,^ 
a  centrally  located  control  unit  to  inform  individual  switching  boxes  of  their  rout¬ 
ing  settings.  Implementation  tradeoffs  must  be  considered  when  choosing  the  control 
structure  for  a  multistage  network.  While  the  advantages  of  constant  path  ^"t-up  and 
simple  interchange  box  logic  make  centralized  control  seem  more  preferable  than  dis¬ 
tributed  control,  centralized  control  disadvantages  far  outweigh  its  advantages  over 
distributed  control.  The  major  disadvantage  of  centralized  control  is  that  only  one 
message  can  be  routed  at  any  instance  of  time,  thereby  serializing  the  network  ac¬ 
cesses.  Using  distributed  control,  multiple  messages  can  be  routed  simultaneously 
yielding  no  bottleneck  effects. 

2. 3. 1.3  The  Multistage  Cube  Interconnection  .Xetu'ork.  The  multistage 
cube  network  [McS81.  Sie85]  is  based  on  the  Cube  interconnection  function  presented 
in  Section  2. 3. 1.2.  Its  topology  is  equivalent  to  the  blocking  networks  characterized 
above.  The  multistage  cube  network  consists  of  loQpS  stages,  where  A'  is  the  number 
of  processing  elements  in  the  system  and  p  is  the  size  of  the  crossbar  switching 
clement.  Each  stage  in  the  network  contains  .V/p  switching  elements.  Each  stage- 
of  the  multistage  cube  implements  the  Cube  function.  By  this,  the  boxes  of  the 
stage  implements  the  Cube,  function.  At  stage  ,,  the  address  lines  that  differ  in  the 
bit  position  are  paired  at  the  switching  elements.  Figure  2.5  shows  the  multistage 
cube  network  for  N  =  8  implemented  with  2-by-2  crossbar  switching  elements. 

2.-1  Parallel  Processing  Systeens 

Contemporary  parallel  processing  systems  have  been  implemented  to  take  ad¬ 
vantage  of  the  architectural  advances  made  in  the  design  of  interconnection  net¬ 
works.  This  section  reviews  five  major  systems  which  have  either  been  implememed 
commercially  or  have  brx’ii  built  solely  for  the  purpose  of  research  of  the  presc'i.t 
technology. 


From  a  network  point  of  view,  parallel  processing  systems  can  be  grouped 
into  one  of  two  architectural  categories:  processor-to-memory  (P-M  )  or  processing 
element-to-processing  element  (PE-PE,  where  a  PE  is  a  processor- memory  pair). 
Processor-to-memory  architectures  use  bi-directional  networks  to  connect  processors 
to  memory  modules.  Processor-to-memory  architectures  are  characterized  by  heavy 
network  loading  which  results  from  inter-processor  communications  and  memory  ac¬ 
cesses  across  the  network.  In  a  PE-to-PE  architecture,  the  network  is  unidirectional 
and  provides  inter-PE  communications  only.  The  PE-to-PE  architecture  differs  from 
the  P-M  architecture  in  that  no  commonly  accessible  memory  modules  exist.  .As  a 
result,  the  network  loading  is  less  in  a  PE-to-PE  system  than  in  a  P-M  system. 
Figures  2.6  and  2.7  show  the  PE-to-PE  and  P-M  architectures  respectively. 

2.4  1  The  Illiac-IV  The  Illiac-IV.  a  SIMD  machine,  was  developed  in  as  joint 
effort  between  the  University  of  Illinois  and  the  Burroughs  Corporation  [StoTT]. 
Proposed  in  1965  and  shipped  in  1972,  the  Illiac-IV  was  one  of  the  first  machines  to 
be  implemented  using  a  parallel  architecture.  Original  proposals  were  for  the  Illiac- 
I\  to  be  a  multi-SIMD  machine  with  four  quadrants,  each  of  which  would  contain 
61  processing  elements.  Only  one  quadrant  was  ever  constructed.  The  Illiar-IN' 
was  primarily  designed  to  solve  partial  differential  equations  and  perform  matrix 
multiplication.  The  interconnection  network  of  the  Illiac-lX'  was  a  variation  of  the 
mesh  interconnection  network. 

2.4-2  The  BB\  Butterfly  The  BBN  Butterfly  is  another  parallel  machine 
whose  interconnection  network  implementation  is  the  multistage  cul:)e.  Manufac¬ 
tured  by  Bolt.  Beranek.  and  Newman,  Inc.,  the  Butterfly  is  designed  for  commercial 
time-sharing  use  as  well  as  for  image  processing  in  a  research  environment  [CrGSoj. 

The  Butterfly  is  designed  to  house  up  to  256  independent  processors.  ,-\l 
present,  the  machine  has  been  commercially  packaged  to  contain  from  1  to  12'' 
processors.  As  mentioned  above,  the  interconnection  network  is  a  multistage  inqile- 


mentation.  The  interchange  boxes  are  4-by-4  crossbar  switches.  Packet  switcliiug 
and  distributed  routing  control  are  used  for  message  transmission.  The  Butterfly  s 
system  architecture  is  a  PE-to-PE  architecture. 

2.4.3  Th(  SYL'  I’ltracompuiir  The  I'ltracomputer  is  a  shared-memory  .M !.\1 1  > 
ii’uK  liine  which  is  presently  under  development  at  New  ^ork  lni\ersity  (mds.j 
This  machine,  when  fully  implemented,  will  house  4096  autonomous  proc  t':?>o! '  lui 
use  as  a  gt'neral  [lurpose  parallel  system.  .\t  the  present,  an  I  Itraromputei  pioln 
ty[a' containing  6  1  processors  has  been  built .  The  I  Itracomputer  uses  a  P-.\I  sysW  no 
ardiitecture. 

The  interconnection  network  of  the  Ultracomputer  is  a  multistage  cutie  whi(  h 
uses  4-by-4  crossbar  switches  as  its  switching  elements.  The  switching  methodologv 
used  b\'  the  Ultracomputer  is  packet  switching  with  the  routing  of  tin'  packets  usin'; 
the  destination  address  routing  method  to  traverse  the  network. 

2. .',.4  Thf  Intt  I  The  Intel  Personal  Super  Computer  (iPS( '  1  is  a  resean  h- 

oriented  .MlMl)  machine.  'I'he  iPSC  architecture  is  more  commonl_\’  referred  to  as 
the  liypircubf  or  binary  n-cubf  t)ased  upon  its  interconnection  network,  a  packet 
switched  implementation  of  a  single-stage  network.  The  hypercube  ma\’  consist  o! 
one.  two.  or  four  v32-node  computational  units.  Each  of  the  cube's  .'12  pro'es'inc 
nodes  can  function  independently  and  concurrently  with  one  another  A  central 
controller  (cube  manager)  serially  j)asses  data  and  i)r<>ri<<e-.  rod--  to  a-tiM-  uo-li-- 
within  the  rube. 

2.4.5  Tilt  IBM  Rtstarch  Paralhl  Proct^sor  Brotoiypf  ( P P.i  1  1  lie  HIM  i^  a 

-Ml.MD  machine  designed  strictly  to  research  the  hardware  and  software  asjiects  o! 
parallel  processing  [Pfli.'l.'j].  Incorporating  much  of  the  N'i  1  I  Itracomputer  de-mn. 
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the  RP3  has  been  designed  and  is  being  constructed  at  the  IBM  T.J.  Watson  He 
search  Center.  The  RP3  will  contain  512  32-hit  microprocessors.  These  512  priKC' 
sors  will  be  grouped  into  eight  modules  with  each  module  containing  61  proc 

The  interconnection  network  of  the  RP3  consists  of  two  separate  network';  a 
multistage  cube  and  a  combining  network  which  is  used  for  interprocessor  coordina 
tiun  functions.  As  with  the  I’ltracomputer  and  the  Butterfly,  the  RP3's  interchange 
boxes  are  implemented  using  4-by-4  crossbar  switches.  The  interprocessor  commu¬ 
nications  are  a  mixture  of  circuit-  and  packet-switching. 

2.  .<  Sun  nil  ary 

In  this  chapter,  an  overview  of  parallel  processing  systems  has  becm  preseiit(><i. 
Ihrc'e  methods  for  classifying  parallel  processing  systems  were  discussed.  While 
Tlynn's  taxonomy  is  the  most  widely  recognized  of  the  three  methodologies,  it  fails 
to  provide  architectural  details  of  a  system.  Both  F'eng  and  Handler  provide  limited 
architectural  insight,  but  fail  in  providing  enough  information  to  accurately  describe 
a  system. 

Thrt'e  interconnection  networks  were  discussed  from  a  functional  imp.emenia- 
tiem  jioint  of  view.  Interconnection  network  communication  switching  methcjdologie' 
wer<'  also  defined.  Each  network's  interconnection  function  was  presented  alciiic  with 
its  physical  layout.  .Multistage  networks  were  discussed  along  with  a  jiresent  at  iun  (jI 
their  associated  interchange  boxes,  topology  and  control  structure,'. 

Five  parallel  processing  systems  were  briefly  examined  from  a  networks  point 
of  view.  These  systems  revealed  the  progression  of  parallel  systems  architectural 
implementations.  This  progression  began  with  the  first  implemented  parallel  ma¬ 
chine,  the  Illiac  1\  .  and  has  proceeded  to  present-day  systems  such  a'-  the  N5’r 
ritracomputer.  I.N'TF.L's  iPSC.  and  IBM's  HP3. 


3.  Performance  Modeling  and  Analysis 


3.1  Introduction 

Over  the  past  six  years,  extensive  research  has  exaniineci  interccninecl ion  net 
works  from  a  performance  modeling  and  analysis  viewpoint.  These  performance 
studies  have  ranged  from  analytically  modeling  the  probability  of  message  collision?- 
in  crossbar  switches,  to  determining  which  of  the  two  switching  methodohjgie^  is 
Ix'st  suited  for  interprocessor  communications.  Further  studies  have  examined  the 
possibilities  of  modeling  interconnection  networks  via  comiiuter  simulations.  In¬ 
terconnection  network  comparison  studies  have  proven  valuable  in  assisting  system 
architects  choose  the  network  which  will  best  suit  the  application.  The  following 
sections  review  previous  research  performed  in  performance  modeling  and  analysi.'., 

3.2  Cro.ssbar  Switch  .Analysis 

The  crossbar  switch,  defined  and  discussed  in  Chapters  1  and  2.  is  used  as 
the  basic  switching  element  in  many  multiprocessor  designs  and  implementations 
.•\dSM.  CrG85.  Dav85.  DaS85,  DaS86.  DiJ81,  GoG83,  Law75,  Mc.’\81.  Pea77. 
PfB8a.  Sie85,  SiH86].  This  section  presents  a  review  of  the  research  performed 
by  Patel  [PatSlJ.  I  sing  analytic  and  simulation  techniques.  Patel  compare^  the 
Delta  networks,  a  permutation  of  the  multistage  cube  network  [Sie85].  and  network' 
comprised  solely  of  crossbar  switches.  This  review  of  Patel's  work  is  concerned  onlx 
with  the  implementation  issues  and  analytical  analysis  associated  with  the  u'e  of 
a  crossbar  switch  as  a  network  switching  element.  For  consistency,  the  luitatioii  oi 
Patel  is  u.sed  in  the  following  discussion. 

From  a  hardware  point  of  view,  the  crossbar  switch  consists  of  two  major 
components:  the  control  logic  and  the  switching  element  itself.  1  h<'  contrtd  logic  is 
used  to  process  message  requests  and  to  |)rovid<‘ arbitral  ion  among  requ<‘slors  in  the 


fVfiit  of  a  conflict.  I  he  function  of  the  switching  «'!eincnt  i.s  to  r(uitc  the  in* ■•".il'i 
or  data  once  the  setting  of  the  switcli  has  been  deterniined  b\  the  control  Iwcii  .\ 
functional  block  diagram  of  a  2- by  2  crossbar  switch  is  shown  in  F  igure  .'3.1, 

The  single  lines  in  Figure  3.1  represent  one  bit  lines.  The  double  lines  inin  and 
out  of  the  INFO  box  represent  address  lines,  data  lines,  and  a  Flead/NNrite  r(jn!iii’ 
liiH'.  The  .\  and  .\  lines  are  used  to  control  the  switch  setting.  If  the  input  .V  is 
logic  1.  the  switch  is  set  to  its  cross  connection.  If  .V  is  logic  (J.  the  switch  is  set  to 
its  straight  connection  (see  Figure  2.4).  F"or  the  2-by-2  ca.se.  only  one-  liit  has  to  b. 
examined  to  determine  the  setting  of  the  switch.  Switches  of  size  .\  recpiire  .N  -  1 
huts  to  be  examined  to  properly  set  the  crossbar  connections.  In  the  implementaticin 
of  a  crossbar  switch,  two  sets  of  control  lines  exist:  the  request,  the  destination,  and 
the  busy  lines  for  the  left  or  input  side  of  the  switch  and  the  request  and  the  biis\ 
lines  for  the  right  or  output  side  of  the  switch.  F'or  the  2-by-2  sized  crosstiar.  two 
sets  of  these  control  lines  exist,  one  set  for  each  of  the  input  and  a  set  for  the  ouifuit 
lines  of  the  switch.  .An  .V-by-.V  crossbar  switch  requires  .V  sets  of  control  lines.  The 
logic  equations  for  the  signals  in  Figure  3.1  are  given  below. 


-V  =  rgdo  +  rod, 

,\  =  Todo  +  Tod, 

f^o  —  ^odo  +  rid] 

=  ^odo  +  ^1  d, 
bo  =  Xiio  +  A7d, 
bi  =  \  Bq  +  \  B\  +  rodod,  +  rodod, 

^0  —  *oA  +  ijA 

A  =  7  0  A  +  1 1  -X 


I  3.1  I 
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The  above  equations  represent  the  logic  equations  for  the  simplest  crossbar,  the  J 
by-'2.  For  largf>  .N  .  the  logi(  equations  for  an  .V-tjv-.N  crossljar  become  complex  io 
the  point  of  intractable. 

In  the  analysis  of  the  crossbar,  a  crossbar  si;'e  of  .\/-b\-.\  is  assume'].  Pa- 
t< Ts  analysis  assumes  a  processor  to  memory  system  architecture  ( st^'  Figure  2.7  e 
I  sine  this  assumption,  the  crosslrar  suj)ports  M  processcjrs  and  .\  memor\  iii'hI- 
ules  Conflicts  caused  b\-  two  requests  made  for  the  same  memors  module  ate  t.. 
be  (uiisideied  memory  (.tuifiicts  rather  tlian  network  conflicts,  further  a>suiiip!  i"ii' 
are  made  to  facilitat*'  the  analysis,  first,  each  procc-ssor  generated  rc’cjuest'  rati 
ciot’dy  and  independc'iit  of  the-  other''  with  the-  recpiest''  uniformly  distributee!  o'.e; 
th'-  memor\'  modules.  .A  secorid  operating  assumption  is  that  in  each  cycle*,  eaeh 
pi'i'e'sseer  gc'iic'rates  iic'w  rcciuest"  witli  a  prcdrability  in.  Using  these*  assumjit  ieui'. 
tlie  fiaiidwidth  (H\\  ).  in  packets  per  unit  time,  of  the  crossbar  anel  the  probabiliiv 
of  message  acceptance  at  the  destination  memory  module,  can  be  derix'e'd  fot 

crossliar  switches  of  varying  size.  For  large  M  and  .V.  the  bandwidtli  and  probability 
of  iimssage  ae  ceptanec  is  represented  irv. 


BW  ~  .V(l 
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d  fie  rom[)lete  dc'ri\'at  ion  of  the  af»ove  ecpiat  ions  can  be  found  in  Pai''i  .  The  <ii‘o\' 
approximations  pio\'ide  W’a  accuracy  when  .M  and  ,N  are  greater  than  dd  and  M'C 
arrurarv  for  M.  .V  >  s. 


■  Circuit  .Sinli  liiiii/ 

Of  the*  four  types  of  switchine  methodologies  prescnte'l  in  ('ha[nei'  !  am 
2.  two  have  been  ])reilominant  ly  used  m  interconnection  network  analvsis:  circuii 


switching  and  packet  switching.  In  this  section  and  the  one  to  follow,  a  review  and 
analysis  of  previous  research  in  these  two  areas  is  presented. 

In  a  circuit  switched  network  implementation,  a  physical  link  betwet-n  sourc<’ 
and  destination  PEs  is  established  prior  to  message  transnussion.  The  path  es- 
tabli>hment  process  is  performed  using  a  "request -grant"  protocol  [MeSSO'.  In  a 
distributed  control  system,  message  requests  are  forwarded  through  the  network, 
setting  the  switching  elements  to  the  appropriate  position  if  available.  Once  the 
transmission  path  is  established,  it  is  held  until  the  transmission  of  the  messagets  i 
is  complete.  In  cases  where  two  or  more  different  messages  desire  the  same  commu¬ 
nication  path  or  a  blockage  occurs  due  to  a  previously  established  path,  a  conflict 
among  the  message  transmissions  arises.  In  the  event  of  a  conflict,  a  method  for 
resolving  the  conflict  must  be  chosen. 

Two  of  the  more  common  conflict  resolution  methods,  the  drop  and  the  liuld 
algorithms,  have  been  the  topics  of  recent  research  [Dav85.  LeW84].  When  a  coniiici 
is  encoutttered  using  the  drop  algorithm,  the  message  request  is  removed  from  the 
network  and  the  partial  communication  link,  up  to  the  point  of  the  conflict  inclusi\e. 
is  relinquished.  Message  requests  that  are  dropped  from  the  network  mtist  be  reini¬ 
tiated  by  the  originator  at  a  later  time.  The  hold  algorithm  differs  from  the  droj) 
algorithm  in  that  when  a  conflict  arises,  the  message  request  is  held  at  the  point  of 
cunflict  until  the  conflict  has  been  resolved.  The  partial  communication  link  i^  held 
intact . 

In  the  work  of  [CheS'd.  ChL8,3.  Dav85,  LeW8}J.  each  used  discrete  time  M,irko\ 
chains  to  aid  in  the  analysis  of  the  circuit  switched  networks  implementing  ('ither 
the  drop  or  the  hold  resolution  algorithms.  Modeling  a  system  using  Markov  (haiii> 
can  provide  a  graphical  representation  of  the  operating  states  of  the  network  as  well 
as  the  state  transition  probabilities. 
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S.S.l  The  Hold  Conflict  Resolution  Algorithm  This  section  presents  the  re¬ 
sults  of  the  mathematical  derivations  performed  by  [Dav85,  Le\V84]  in  determining 
the  state  transition  equations  for  the  hold  conflict  resolution  algorithm.  For  consis¬ 
tency  in  the  presentation  of  these  results,  the  notation  of  [Dav85]  is  used. 

Using  the  Markov  chain  shown  in  Figure  3.2,  the  hold  conflict  resolution  algo¬ 
rithm  can  be  modeled  for  a  4-stage  network.  The  derivations  of  the  state  transition 
probabilities  assume  a  2-by-2  size  switching  element  is  implemented  in  the  network. 
The  network  can  be  modeled  by  using  two  sets  or  states;  the  request  states.  R^.  and 
the  blocked  states,  D,.  0  <  z  <  n.  These  states  represent  the  possible  states  that  a 
message  request  may  encounter  as  it  attempts  to  traverse  the  network  and  establish 
a  path.  State  P  represents  the  processing  state  of  the  message  at  a  local  PE.  State 
T  represents  the  state  in  which  the  data  transfer  may  begin.  This  state  signifies  the 
establishment  of  the  communications  link  between  source  and  destination.  .Assuming 
a  message  generation  rate.  m.  from  each  source  PE,  and  a  time  delay,  d.  associated 
with  the  state  T.  the  state  transition  probabilities  are  given  below  in  Equation  3.-5. 


q{T.P)  =  1/d 
q{T,T)  ^  1-1/d 
q{P,Ri))  =  m 
q{P,P)  =  1  -  m 


(3,1) 


In  deriving  the  transition  probability  of  moving  from  state  R,  to  state  R,^\. 
the  three  possible  causes  for  blockage  in  a  2-by-2  switching  element  were  summed. 
The  resulting  probability  of  transitioning  from  one  stage  to  the  next  is  given  b\ : 


n—  1 


<7(/?..  /!>.+, )  =  1  -  .25/?.  -  B,  -  .5  ^  (/?,  -e  /?;)  -  0..5/-(7- 

]=i+i 


2‘> 


(3,5) 


The  transitions  from  state  7?,  to  state  B,  was  defined  as  follows; 


<?(/?..  5.-') 


.2op{R,)  -  p(Bj)  j  =  i 
.5(p{Rj  +  piBj))  i<j<n 
.5p{T)  j  =  n 


(.'fin 


In  Equation  3.6.  the  superscript  notation.  Bj,  is  used  to  indicate  that  the  blocked 
message  is  in  state  j.  The  superscript  changes  as  the  blocked  message  moves  from 
state  to  state. 


The  probability  of  transitioning  from  one  stage  to  the  next  stage  in  a  particular 
time  cycle  t  was  defined  to  be: 


_  P<-n(^j+i)  _ 

p,(R,)  +  p,(B,) 

Equation  3.7  shows  that  the  probability  of  transitioning  through  stage  j  is  the  ju-ob- 
ability  of  being  in  stage  j  divided  by  the  probability  of  being  in  state  Rj^x-  Tsini: 
this  time-based  probability,  qj,.  the  transition  probabilities  for  state  B\  were  shown 
to  lie: 


[  1 /d  J  =  n 

q{B^,.R,^x)  =  {  I  Tv. 

[  0  J  <n 


q{B^,.Br') 


<ij  i  <  J  <  " 

1  -  <Ij+^  J  =  > 


I  3, '1 1 


0  !  <  j  <  U 

+  \  J  =  I 
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3.3.2  Th(  Drop  C'onflirl  Rt solution  .MgorUhtn  Tsing  an  a[q)roa(  li  similar  to 
their  analysis  of  the  hold  conflict  resolution  algorithm,  be*'  and  W'u  [Lew  'll]  proeiit 
a  two  dimensional  structure  which  models  the  message  recpiest  transition--  thnauj.!! 
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the  network  when  the  drop  conflict  resolution  algorithm  is  used.  The  Markov  chain 
representation  for  a  4-stage  network  is  shown  in  Figure  3.3.  As  pointed  out  in 
[Dav85],  this  representation  is  an  approximation  to  the  actual  system  operation  in 
that  multiple  blocking  messages  are  not  addressed. 

The  states  P  and  T  of  Figure  3.3  represent  the  same  states  as  those  in  the  hold 
algorithm  analysis:  the  local  processing  at  the  PE  and  the  message  transmission. 
The  states  represent  the  traversal  of  the  message  requests  through  the  network. 
The  Rt  j  states  encompass  the  record  keeping  of  the  location  of  a  blocked  message 
requests,  which  in  expanded  notation  is  rej)resented  by  a  third  subscrijtt.  k.  1  hr 
first  row  of  the  model  represents  the  first  transmission  attempt  of  a  message  reciue-t 
entering  the  network.  Subsequent  rows  depict  the  retransmission  of  the  message 
request  following  blockage  in  the  previous  row.  Column  states  are  used  to  model  the 
message  requests  at  a  particular  stage  of  the  network. 

As  with  the  hold  algorithm,  the  transition  probabilities  between  the  message 
processing  state.  P.  the  transmission  state.  T.  and  the  first  path  request  state  ar(' 
the  same: 


<7(7'. -P)  =  !/</ 

q(T.  T)  =  1  -  1  /(/ 

9(P.  Po.o)  = 

q{P.  P]  =  1  —  77i 


(3.1 1  • 


Letting  T  be  denoted  by  stage  n.  the  probability  of  being  in  column  j  is  defined  as: 


Similar  to  the  derivations  in  the  hold  algorithm,  the  transition  probabilities  llir(Hii:li 
the  first  row  are  found  to  be: 

Ti-  i 

q{Ro.j .  /^(j,j  +  i )  =  1  -  Y.  ^ 

m  =  j  +  1 

When  a  blockage  occurs,  the  message  rerpiest  is  dropped  frcun  the  network  ano 
resubmitted  in  the  row  that  corresponds  to  the  column  where  the  blockage  ocn  nire.! 

I  he  transition  probabilities  are  representt'd  by; 

.255,  k  =  j 

q( /fj  +  i,o.icl  =  ^  .55\.  j  <  k  <  n  '5,1  1  ■ 

.hp{T)  k  =  II 

Equations  3.13  and  3.14  represent  all  possible  transitions  in  the  network  foi 
message  requests  that  are  independent  of  other  requests  that  may  be  in  the  netuuik 
Transitions  that  occur  when  a  previously  blocked  message  request  rt'turn^  to  tlio 
column  of  its  initial  blockage  are  considered  to  be  dependent  transitions  'l)av^5  .  lo 
facilitate  the  analysis  of  these  dependent  transitions,  the  probability  of  t ransit ionini; 
through  a  column  j  in  one  time  cycle  is  defined  as: 


From  Efpiation  3.15.  f/,  is  the  probability  that  the  message  request  did  not  (  oiiipli  tc 
its  transmission  in  the  present  time  cycle,  meaning  that  the  message  is  still  acti\e  in 
the  system.  The  probabilities  are  shown  to  be: 


q,  =  \  -  Xjd  n  <  J 
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Using  the  representation  for  Qj,  the  probability  of  the  blocking  request  being  artivt 
in  the  network  after  the  present  time  cycle  is: 


k  +  j 

m  =  ic 

Qk.j  can  be  described  as  the  probability  of  the  blocking  message  transitioning  froiii 
column  k  to  column  k+j  while  the  blocked  message  is  resubmitted  and  returned  to  the 
point  of  blockage  [Da\'85].  I  sing  these  df'rivations.  the  transitions  for  the  dependent 
states,  can  be  derived,  hirst.  Davis  derives  the  transition  probabilities  c^f  not 

t.eing  blocked  and  transitioning  through  the  column.  These  probabilities  are  fcuind 
to  be: 


I  { 1  -  Qk, )( 1  -  .25>',  -  ..5  e:::= )  a-  ^ ./ 

[  ( 1  -  1  -  .25b-  -  .5 )  k  =  j 


The  transitions  resulting  from  the  repeated  blockage  due  to  the  blocking  messace 
and  the  blockage  by  a  new  request  are  given  in  Equations  3.19  and  3.20. 


<■/(  1  = 

HI  — 

inni{k'  -I-  J.  r, ). 

1  Qm 

1.;  1  = 

iutu{k-  u-  j_  u 

=  ./ 

.2.5(1 

w  =  J. 

J 

Rj+\.Q.n]  =  < 

•5(  1  - 

Qk.j)Srr. 

J  <  rn  <  ri. 

■  ^  j 

.25(  1 

—  Qk-n.j-\ 

)Sj  rv  =  J. 

f^-  =  j 

.5(  1  - 

Qk,^,^^): 

''  t,,  j  <  m  <  71. 

^  ^■=j 
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And  finally,  blockage  transitions  in  the  input  stage  were  derived  in  [LeW'Sl]  and  arc 
given  below. 


•yl'ftl.O.n-  =  1  ~  (3.21) 

~  Qk  (3.22) 

<?(  fli.o.*- ^i.o.o)  =  --^(l  ■“•?*.-)  (3.23i 

q{Ri.o.k.Rui)  =  M^-Qk)  (3.24) 

?( ^1.0.0- ^I’l.o.i )  =  1  (3.2.>  ! 


3.4  Packet  Switching  Analyens 

Packet  switched  network  performance  and  analysis  has  been  the  topic  cjf  ex¬ 
tensive  research  in  recent  years  [DiJSla,  DiJSlb,  MuM82.  Che82.  KrS83.  CliHS4. 
DaS86].  In  this  section,  the  basic  principles  of  packet  switched  networks  are  exam¬ 
ined  along  with  the  techniques  used  to  model  this  type  of  network  implementation. 
Results  from  previous  research  will  also  be  discussed. 

Packet  switched  networks  differ  from  circuit  switched  networks  in  t  he  manner  in 
which  the  communication  link  between  source  and  destination  pairs  are  maintained. 
\\  hile  the  complete  link  is  held  until  a  message  has  completed  its  traversal  of  tli<' 
network  in  a  circuit  switched  environment,  the  link  between  switching  elements  i> 
held  just  long  enough  for  the  message  to  traverse  the  link  in  a  packet  switched 
network.  This  eliminates  the  need  for  a  complete  path  from  source  to  destination 
prior  to  the  message  transmission. 

.Advantages  and  disadvantages  exist  when  using  packet  switched  networks  in¬ 
stead  of  circuit  switched  implementations.  An  advantage  of  packet  switched  net¬ 
works  is  the  ability  to  pipeline  the  packet  transmissions  thereby  potentially  reducing 
the  overall  transmission  delays  and  increasing  (he  network  throughput.  With  tlu-se 
benefits  come  potential  drawbacks  of  this  implementation.  Since  each  pai  ket  estah 
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lishes  its  own  path  through  the  network,  the  path  establishment  delays  increase  a  - 
the  number  of  packets  in  a  message  grows.  Additionally,  the  logic  required  to  con¬ 
trol  the  switching  elements  is  more  complex  than  a  comparable  network  constructed 
using  the  circuit  switched  methodology. 

As  in  the  circuit  switched  network,  conflict  resolution  algorithms  exist  in  packet 
switched  networks.  When  a  conflict  occurs  at  a  switching  element,  one  of  three 
resolution  algorithms  can  be  used:  the  hold,  the  drop,  and  the  reroute  algorithms. 
The  hold  and  the  drop  algorithms  are  the  same  as  those  discussed  in  the  previous 
section.  When  the  reroute  algorithm  is  used,  a  blocked  message  is  reroute  to  an 
incorrect  destination  for  resubmission  by  that  destination  to  the  correct  destination. 

.Methods  for  modeling  packet  switched  networks  which  have  used  both  discrete 
and  continuous  time  .Markov  chains  have  been  presented  in  [DiJ81a,  Di.lSlb.  CheS'J. 
KrS83.  ChH84,  DavSo].  Queueing  models  consisting  of  n  nodes  which  represent  the 
n  stages  of  the  network  can  be  used  to  analyze  the  packet  traversal  of  the  network. 
-Switching  elements  are  represented  by  queues  at  the  inputs  of  the  n  nodes.  Quern ■ 
lengths  are  assumed  to  be  of  a  finite  length  determined  prior  to  implementation. 

Dias  and  Jump  [DiJ81a,  DiJ811)]  investigated  the  affects  on  the  network  dm’ 
to  varying  buffer  sizes.  By  doubling  the  buffer  size  from  one  to  two  packets,  they 
were  able  to  show  that  the  throughput  of  the  network  is  also  increased.  The  increase 
in  throughput  continues,  to  a  point,  as  the  size  of  the  buffers  is  increased,  l  lie 
throughput  begins  to  remain  approximately  constant  with  buffer  sizes  in  ('xre>-.  i.l 
4-6  packets.  It  is  also  shown  that  packet  delays  increase  as  the  buffer  sizes  iin  rea'-e. 
Tile  bottom-line  of  the  research  of  Dias  and  Jumir.  shows  that  an  ojitimal  butfei 
size  exists  to  maximize  the  network  throughput.  Additional  research  performed  uii 
multiple-packet  networks,  by  [DaS86].  is  presented  in  the  sections  which  folk-w. 


3.5  Tradeoff  Analysis  of  Switching  Methodologies 

In  this  section,  a  tradeoff  analysis  of  circuit-switched  versus  packet-switched 
multistage  generalized  cube  network  is  reviewed.  Two  operating  modes  are  presented 
in  the  original  analysis;  the  SIMD  mode  and  the  MIMD  mode.  This  review  is  onl\ 
concerned  with  the  MIMD  mode  of  operation  due  to  its  relationship  to  this  thesi> 
investigation. 

Davis  and  Siegel  [DaS86]  perform  a  comparative  study  into  the  effects  of  choos 
ing  packet-switching  versus  circuit-switching  as  the  switching  methodology  for  ih' 
multistage  cube  network.  The  effects  of  multiple- packet  messages  on  the  nelwic  k 
are  also  researched.  Results  from  this  research  are  detailed  in  the  te.vt  which  follow*-. 

In  a  MIMD  environment,  the  generation  and  transmission  of  messages  to  a:;*: 
through  the  network  occur  cisynchronously.  Messages  generated  consist  of  a  hcad<’i . 
containing  the  routing  information,  and  one  or  more  data  words.  If  the  size  of  the 
message  exceeds  the  maximum  single-packet  size  allow-ed  by  the  network,  the  message 
must  be  broken-down  into  multiple  packets  for  transmission  through  the  network. 
Each  packet  contains  the  same  routing  information.  In  the  multistage  cube  network, 
a  single  path  exists  between  a  source  and  destination  pair  [McS8l].  When  multiple 
packf'ts  occur,  those  packets  must  be  routed  sequentially  to  ensure  proper  orderiiiu 
at  the  destination. 

Davis  and  Siegel,  in  analyzing  the  performance  of  multiple- packet  message-, 
introducf'  two  parameters  to  aid  the  analysis.  The  packft  cycle  time  is  dehned  a- 
the  time  delay  associated  with  the  packet  moving  from  an  input  to  an  output  of  a, 
network  interchange  box.  The  packet  offset  time  is  considered  to  be  the  time  between 
successive  packet  generations  in  a  multiple-packet  message.  This  time  can  be  fnrtliei 
described  as  the  time  difference  in  the  speed  of  the  system  PEs  and  the  network.  .\- 
an  example,  the  packet  offset  time  is  equal  to  one  when  the  time  to  generate  a  pac  ket 
is  equivalent  to  the  time  required  by  an  interchange  box  to  process  the  j)ackei . 


Results  from  their  study  shows  that,  for  a  given  message  size,  the  dela\  of 
a  packet  in  the  network  decreases  as  the  packet  offset  increases.  This  is  to  be 
expected,  as  an  increase  in  the  packet  offset  reduces  the  apparent  network  loading 
and  subsequently,  the  network  conflicts.  These  values  can  be  used  to  compare  against 
the  ideal  times  required  by  a  packet  to  traverse  the  network.  Under  ideal  conditions, 
an  m-packet  message  requires  k+(m-l)  packet  cycles  to  traverse  the  k-stage  network. 

When  making  the  choice  of  packet-switching  or  circuit-switching  for  network 
implementation,  different  factors  must  be  considered.  First,  the  operational  moilc 
(SIMU  or  MIMD)  will  determine  the  effects  on  the  network  due  to  conflict.s  ainon^ 
the  messages  and  the  associated  queueing  delays.  A  second  issue  is  the  type  tif 
systems  architecture  that  supports  the  network.  The  two  types.  PE-to-PFl  and  PM 
system  architectures,  discussed  in  Chapter  2.  determine  whether  the  network  will  be 
characterized  by  light  loading  and  low  conflicts  (PE-to-PE).  or  more  heavily  loaded 
with  greater  conflicts  in  a  P-M  architecture.  Through  cacheing  techniques,  network- 
supporting  P-.M  systems  are  shown  to  perform  equivalently  to  those  in  PE-to-PF 
architectures. 

To  compare  a  circuit-switched  network  to  a  similar  network  implemented  us¬ 
ing  packet-switching,  the  internal  and  external  environment  must  be  the  same  in 
both  cases.  Design  implementations  such  as  data  path  width,  interchange  box  im 
[)lementations,  and  PE-network  interfacing  techniques  are  internal  factors  that  inu-t 
be  considered.  External  factors  that  must  be  the  same  for  comparative  purpose- 
are;  system  size,  processing  speeds  and  network  loading.  Once  these  factors  are 
determined  and  set.  a  valid  comparison  of  the  switching  methodologies  can  be  made. 

Davis  and  Siegel  conclude  that  the  circuit  switched  network  provides  better 
*  performance  for  smaller-sized  messages  than  do  the  packet-switched  network.  The 

packet-switched  network  performs  better  for  messages  of  longer  lengths.  These  re¬ 
sults  also  show  that  the  performance  of  the  netwe^rk  is  highly  influenced  by  the 
,  processing  rates  of  the  PEs. 
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3.6  Comparisons  of  Intcrconntction  Sftworks 

The  ability  to  accurately  compare  interconnection  network?  which  differ  topo¬ 
logically  is  essential  in  determining  the  suitability  of  the  network  to  a  particular 
application.  For  this  reason,  recent  research  efforts  have  been  directed  toward  these 
types  of  comparisons.  1  his  section  reviews  the  work  of  Dally  [DalSCj.  .Mirahain 
and  Padrnanabhan  [.AbPHG],  and  Hsu,  'Ww,  and  Zhu  [Hs'tST].  Two  of  these  studie- 
[DaDG]  [Szy86].  perform  comparisons  of  networks  based  on  \'LS1  design  constraint>. 
.Analytical  modeling  and  analysis  of  network  performances  are  examined  by  [.AbPSG'. 
Comparisons  of  the  single  stage  cube  (hypercube)  network  to  a  newly  proposed  net¬ 
work  is  the  to])i''  of  research  of  [HsA'ST], 

3. Cl  \I..3I  Cumpart.'^nn  of  Ion  and  Binary  n-Cubi  .\ctu'ork\-  W.  J.  Dali} 
DaDb  presents  a  comparison  of  interconnection  networks  based  on  the  wiring  re¬ 
quirements  of  \’LSI  circuits  used  to  implement  the  networks.  In  his  study.  Dali}' 
compares  low-dirnensional  networks  (e.g..  mesh)  to  high-dimensional  networks  (e.g.. 
binary  n-cubes).  with  each  having  the  same  bust ctional  width.  The  bisectional  width 
of  a  network  [Tho8Uj  i.s  the  minimum  number  of  wires  that  must  be  cut  if  the  net¬ 
work  is  to  be  divided  into  two  equal  halves.  This  comparison  is  further  based  of  threi 
[lerformance  parameters:  latency,  average  case  throughput,  and  hot-spot  through j<ut 
Latency  is  the  time  interval  between  successive  initiations.  The  average  case  throuith- 
l)ut  is  defined  as  the  average  number  of  messages  })rocessed  by  the  network  in  a  unii 
of  time.  .-X  measure  of  the  throughput  between  a  pair  of  processing  eh-ments  whirl! 
receive  a  disproportionately  large  amount  of  the  network  traffic  is  calh'd  th»'  hot  sjk-1 
throughput. 

One  of  the  main  operating  assumptions  used  in  Dall}'’s  stud}  is  the  U'-t'  ol 
wornrhole  routing.  Wormhole  routing,  recall  from  C'hapter  2.  is  a  \ariation  of  \  ntua! 
cut  through  routing  [KeKThj.  These  two  methods  differ  in  the  manner  in  \s  hi(  h  a 
blocked  message  handled.  While  virtual  cut  through  removes  a  blocked  mes''.ige  from 


the  network,  wormhole  routing  retains  a  blocked  message  in  the  network.  The  Ik-jm- 
fits  of  using  wormliole  routing  over  the  store-and-forward  routing  method  is  redun  d 
network  latency.  This  is  shown  mathematically  below  in  Equations  3.2h  and  3.-*7 
and  graphically  in  Figure  3.4. 

The  derivation  of  the  network  latency  using  the  store-and-forward  am!  wwr;:, 
hole  routing  techniques  is  dependent  upon  two  components  of  latency,  the  di.-tano 
(D)  and  the  message  aspect  ratio  (i/U').  The  distance,  D.  is  defined  as  the  ii,e,i: 
point-to-point  distance  (in  hops)  from  source  to  destination.  The  mes'-ace  ayu  >  ’ 
ratio  is  the  ratio  of  the  message  length,  over  tiie  normalized  channel  v.id'ii.  U  . 
and  can  be  described  as  the  number  of  channel  cycles  necessar>'  to  traU'::;;'  the 
message  across  one  ch.uinel.  Using  the  store-and-forward  me*’  d,  an  entire  me-'ai;' 
must  be  received  b\'  an  intermediate  node  prior  to  the  message  he  ,ag  train'iiiutcd  tc 
the  next  node  in  the  communication  link.  Tlie  latency  of  the  network  must  then  hi- 
the  product  of  the  distance  through  the  network  and  the  messag'  asjiect  ratio. 

jrf  — and— f  oru  ir  i  —  T  ch.  annuli  ^  TU  ^  i3.3f' 

f  "-ing  wormhole  routing,  partial  message's  nia\'  be  transmitted  uf>on  nnuyt  o! 
the  control  bits  (fiit.sj  to  the  next  node'  in  the  communication  link,  d  he  la’em  y  ot 
the  network  can  now  be  represented  by  the  sum  of  D  arnl  I.jW . 


T'n  :.^rnhol''  —  channrlK  ' 

In  Equations  3.2b  and  3.27.  l\k,annf'.  i>  the  channel  cycle  time,  tlie  turn'  neid'  i 
to  cor  iplete  a  transaction  on  a  channel.  Figure  3.1  shows  tliat  the  !,iO  i,.  \  ;m.- 
required  for  routing  a  message  through  three  piocessing  nodes  is  eo  atly  itd  cid 
when  wormhole  routing  is  used  insteafi  of  store  and-forward  teilmique'. 

Noting  tliat  fii  two  proC('sM)rs  puked  at  randciin  from  a  ^  arc  u  i  a',,'.  >!.■ 
a  efface  numl>ei  i  if  <  h.miif  I-  that  must  l.f  t  ravf  rsfd.  /  h  i-  cic  en  he 


D  =  { 


)ri. 


^3.2^) 


Also  the  normalized  channel  width.  U  is  a  function  of  the  dimensionality,  n.  and 
the  radix,  k  of  the  network  and  can  be  represcntt'd  b\ 
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I  sing  F.cjuations  3.28  and  3.29  substitutf'd  into  Equation  3.27.  the  net  vsurk  lai< 
fur  a  A’-arv  7;-cube  i,'-; 


Equation  3.30  shows  that  the  latency  of  the  network.  is  dominated  1)\'  the 
dimension,  ti.  of  the  network.  This  equation  also  assumes  a  constant  w’re  length 
among  the  networks.  Further  derivations  of  the  network  latency  include  the  effect> 
due  to  variable  wire  lengths  for  liiu’ar  and  logarithmic  delays  associated  with  the 
lengths  of  the  interconnection  wiring.  Th(‘  presentation  of  these  deri\’ations  are 
beyond  the  scope  of  this  review.  I  he  reader  should  ri'fer  to  [Dalsty  for  fnrthe: 
information  about  these  derivations.  For  each  of  the  derivations  for  tlu'  network 
latenc}.  the  impliiation  is  that  low  dimensional  networks  will  ha\e  lower  latencies 
than  the  higher  dimensional  networks. 

I  he  second  perf(;rmanre  measur«- examined  lyv  Dally,  is  the  through]nit  of  tlir 
net  W'  jrk ,  I  he  aj)j)roach  t  aken  to  est  imate  the  t  h  rough  put  is  to  calculate  t  he  capai  i  t  y 
of  'he  net  work,  t  h<'  maxi  mu  tn  number  of  messag<'s  that  (um  be  in  the  net  woi  k  ;tt  an;, 
given  instance  of  t  ime. 

1  he  maximum  through|)ut  as  a  fraction  of  the  caj)acity  for  F  ary  7/-cubes  can 
be  derived  Using  the  following  assumptions  and  l',(juations  .3.31  and  3.32.  1  he  results 
of  t!,i^  ileruation  ,iie  presentf-d  m  lalile  .3.1. 
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1.  Each  processing  node  supplies  to  the  network  a  traffic  load  of  A-^^. 


2.  The  message  rate  on  channels  entering  the  dimension  is  i  g”^f ”  • 

3.  A  message  traverses  channels  per  dimension:  one  entering  channel  and 


(7  —  continuing  channels.  This  gives  a  channel  rate  continuing  in  a 

dimension  of  A^  =  aX£. 

4.  The  service  time  in  dimension  !  +  1  is 

5.  riie  service  time  of  the  last  continuing  channel  is  the  dimension  is  = 

v..,. 

ti.  The  probabilit}  of  collision  is  \eT,q  and  the  expected  waiting  time  to  resolve 
the  collision  is 

Using  these  assumptions,  the  service  rate  for  the  entering  channel  of  the  dimension 
is  given  bv; 


l\o  = 


1  —  \/l  —  2\lx+\ 


A 


c 


(3.31 


Equation  3.31  is  only  valid  when  A^  <  The  service  rate  of  the  channel  of 

the  dimension  is  given  by: 


T, 


+1 


1  - 


1  + 


AcTn- 


(3.32j 


Setting  the  source  seivice  time,  7'o,  to  the  reciprocal  of  the  message  rate,  Xe  and 
solving  Equations  3.31  and  3.32  for  A^-,  yields  the  maximum  throughput  of  the 
network. 

Table  3.1  shows  the  maximum  throughput  for  ^'-ary  n-  cubes  which  support 
2")b  and  1024  processing  nodes.  The.se  calculations  of  total  latency  are  for  message 
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Parameter 

Dimension 

radix 


2.56  Nodes 

1  1024  N’oe 

es 

2 

4 

8 

2 

5 

10 

16 

4 

2 

32 

4 

2 

0.40 

0.49 

0.21 

0.36 

0.42 

0.18 

43.9 

121. 

321. 

45.3 

128. 

377, 

51.2 

145. 

648. 

50.0 

162. 

NA 

64.3 

180. 

NA 

59.0 

221. 

NA 

Table  3.1.  Maxin.  .m  ■  nroughput  as  a  function  of  capacity  and  blocking  latency  in 
cycles  /al86]. 


lengths,/,  =  200  bits.  Tabic  3. i  shows  that  the  blocking  effects  due  to  dimensionality, 
are  reduced  as  the  dimension  of  the  network  is  reduced. 

The  third  performance  figure  examined  is  the  hot-spot  throughput.  The  hot¬ 
spot  throughput  for  a  /r-ary  n-cube  which  uses  deterministic  routing  is  Qhs-  and 
equates  to  the  bandwidth,  VT,  of  a  single  channel.  Dally  uses  an  assumption  of 
constant  wire  cost  to  represent  Qhs  by: 


0//5  =  H  ^  k  —  1 


(3.33  I 


Since  low-dimensional  networks  have  greater  channel  bandwidth  than  do  high-dimensim.; 
networks,  the  hot-spot  throughput  will  also  be  greater  in  the  low-dimensional  net¬ 
works. 

In  this  study.  Dally  compares  low-dimensional  networks  (tori)  to  high-dimension,:  : 
(A:-ary  ri-cubes)  using  the  assumptions  outlined  above.  Not  included  in  his  incest i- 
gation  is  the  indirect  or  multistage  /-ary  ri-cube  networks.  This  exclusion  was  due 
to  his  rationale  that  multistage  network  performance  is  similar  to  the  direct  k  avy 
n-cube  networks  of  high-dimensionality. 


3.6.2  Performance  of  the  Direct  Binary  n -Cube  Network  Abraham  and  Pad- 
manabhan  in  [AbP8C],  present  a  mathematical  analysis  of  the  direct  binary  n-cube. 
Their  investigation  considers  two  performance  measures;  the  probability  of  message 
acceptance  and  the  bandwidth  of  the  network.  These  performance  measures,  once 
derived,  prove  to  be  better  than  similar  measures  for  the  indirect  binary  n-cube  pro¬ 
duced  by  Dias  and  Jump  [DiJ8l],  Patel  [PatSl],  and  Kruskal  and  Snir  [KrS83].  For 
comparison  purposes,  the  crossbar  switch  sizes  are  limited  to  8-by-8. 

The  analysis  of  the  direct  binary  n-cube's  performance  is  performed  twice.  The 
first  analysis  assumes  a  single-accepting  PE  scheme,  where  only  one  message  can  be 
accepted  by  the  PE  in  a  single  cycle.  The  second  approach  assumes  a  multiple- 
accepting  PE  scheme,  where  up  to  d  messages  can  be  received  by  the  PE  in  a  single 
cycle  for  a  d-dimensional  network.  Also  considered  are  the  cases  where  messages  will 
and  will  not  be  buffered  at  the  switches. 

Simplifying  assumptions  and  definitions  to  this  research  are: 

1 .  All  nodes  are  identical  with  the  traffic  between  nodes  being  equally  distributed. 

2.  The  message  generation  rate  at  each  PE  is  m^. 

3.  The  rate  at  which  message  arrive  from  neighboring  node  is  m. 

4.  The  rate  of  message  entering  a  PE  is  nia- 

5.  Pt  is  the  probability  that  a  message  received  from  a  neighboring  node  i';  for 
the  PE. 

b.  is  the  probability  of  message  acceptance. 

7.  Pa  is  the  probability  that  a  message  received  from  a  node  is  successfully  trans¬ 
mitted  to  another  node. 

8.  d  is  the  dimension  of  the  network. 

9.  N  =  2'^  is  the  number  of  nodes  in  the  network. 

If. 


Direct  Binary  Indirect  Binary 

n-cube  n-cube 


Network  Size 

256 

1024 

4096 

256 

1024 

4096 

9 

11 

13 

2 

bh 

4 

2 

4 

s 

Switches 

256 

Ik 

4k 

Ik 

256 

5k 

1280 

24  k 

6k 

2k 

Lines 

2304 

Ilk 

52k 

2304 

1280 

Ilk 

6k 

.52  k 

28k 

20  k 

Crosspoints 

20736 

121k 
_ 1 

676k 

4k 

4k 

20k 
_ 1 

20k 

96k 

9Gk 

_ J 

128k 

Table  3.2.  Hardware  requirement  comparisons  [.'XbPSG]. 


In  each  of  the  analyses,  mathematical  relationships  are  derived  and  used  for 
coinpcirison  to  the  results  obtained  through  simulations  of  the  networks.  Refer  lo 
[.AbPSG]  for  an  in-depth  presentation  of  the  mathematical  derivations.  Using  the 
above  results,  and  through  simulations,  the  direct  binary  n-cube  is  found  to  give 
higher  than  the  indirect  binary  n-cube  when  the  size  of  the  crossbar  in  the  indirect 
network  is  less  than  an  8-by-8.  Using  an  8-by-8  crossbar  switch  reveals  that  the  direct 
and  indirect  binary  n-cubes  have  equal  performance.  The  performance  advantage,-, 
of  the  direct  cube  for  small  sized  crossbars  do  not  come  about  without  cost  penalties. 
Table  3.2  shows  various  measures  of  costs  for  the  two  types  of  networks. 

S.6.S  An  Enhanced  Hypercube  Eeiu'ork  The  work  of  Hsu.  \ew.  and  Zhu 
[HsT'87]  focuses  on  the  introduction  of  a  new  interconnection  network  structure,  the 
Block- shuffled  Hypercube  (BSH).  It  is  proposed  that  the  BSH  imjrlemcntation  may 
be  able  to  replace  the  hypercube  network  with  minimum  hardware  changes.  The 
interconnection  scheme  of  the  BSH  is  based  on  the  shuffle-exchange  interconnection 
method  of  [Law7.5].  As  an  enhancement  to  the  hypercube  interconnection  network. 
[HsY87]  show  that  the  hardware  requirements  of  the  BSH  network  are  substantially 
less  than  those  required  by  the  hypercube.  Using  an  message  generation  approacli 
identical  to  [AbP8G].  it  is  shown  that  the  BSH  network  outperforms  the  hy])('rculie 
interconnection  network  in  terms  of  message  dedays  and  hardware  costs.  Based  on 
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the  reasoning  of  [AbP86],  the  authors  further  conclude  that  the  BSH  network  alsu 
outperforms  the  multistage  cube  networks  of  similar  network  size. 

5,7  Summary 

In  this  chapter,  the  state  of  performance  modeling  and  analysis  of  interconnec¬ 
tion  networks  was  examined.  The  use  of  computer  simulations  to  model  interconnec¬ 
tion  networks  provide  a  low  cost  and  timely  method  for  determining  the  feasibility 
of  a  systcm(s)  in  question.  VLSI  comparisons  of  interconnection  networks  prove  in¬ 
valuable  as  the  technology  is  rapidly  progressing  to  meet  the  needs  of  systems  which 
require  processing  units  on  the  order  of  tens  of  thousands. 

The  reviews  of  the  performance  modeling  and  analysis  of  interconnection  net¬ 
works  presented  above  reveal  that  much  is  yet  to  be  learned  from  these  types  of 
analysis.  First,  the  present  studies  are  limited  in  the  scope  of  their  comparisons. 
Normally,  one  type  of  network  is  compared  against  the  next  under  simplifying  and 
constraining  assumptions.  There  needs  to  be  comparisons  of  classes  of  networks  in 
a  broader  sense.  The  works  by  [Dal86]  and  [AbP86]  present  examples  of  the  lim¬ 
ited  nature  of  performance  comparisons.  While  Dally  contends  that  low-dimensional 
networks  perform  better  that  do  high-dimensional  networks  such  as  the  multistage 
cubes,  this  contention  may  not  be  totally  correct.  One  of  the  main  points  presented 
by  [DaS86]  was  the  packet  offset  time  or  time  differential  between  packet  generation 
and  packet  presentation  to  the  network.  This  is  an  important  point  to  consider.  If 
the  processing  elements  take  orders  of  magnitude  longer  to  present  the  packet  tu 
the  network  than  the  packet  will  take  to  traverse  the  network,  the  bottleneck  has 
now  shifted  from  the  network  to  the  processing  elements.  If  this  is  the  case,  reex¬ 
amination  of  performance  comparisons,  such  as  the  ones  presented  above,  must  he 
researched.  In  Daily's  work,  this  time  differential  is  not  addressed,  .\dditionally. 
the  work  of  Abraham  and  Padmanabhan  is  limited  in  it  <  ope.  In  their  comparison 
of  the  direct  binary  n-cube  with  indirect  binary  n-ciilx the  size  of  the  switc  hing 
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elements  wais  limited  to  8-by-8.  \\  hile  this  may  have  been  a  sufficient  limiting  size 
to  vaJidate  their  research  for  small  size  crossbar  switches,  work  by  [Szy86]  has  shown 
that  with  present  VLSI  techniques,  crossbar  sizes  of  32-by-32  are  now  realizable. 
Further  work  is  necessary  to  determine  what  effects  the  increased  crossbar  switch 
sizes  will  have  on  the  networks  examined. 


The  information  contained  in  this  chapter  is  used  as  a  base  for  further  research 
in  the  parallel  processing  systems  environment.  Specifically,  the  following  chapter^ 
present  the  methods  used  and  results  obtained  in  the  comparison  of  three  inter¬ 
connection  networks  whose  sizes  range  from  64  to  1024  processing  elements.  The 
interconnection  networks  to  be  modeled  for  evaluation  are  the  single  stage  cube,  the 
multistage  cube  and  the  Illiac  I\'  networks. 
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Interconnection  Network  Modeling 


4-1  Introduction 

This  chapter  presents  the  methodology  used  in  the  modeling  and  simulation  of 
the  multistage  cube  networks,  the  single-stage  cube  networks,  and  the  Illiac  I\'  mesh 
network.  A  discussion  of  network  modeling  via  computer  simulations  is  presented 
in  Section  4.2.  In  this  section,  an  introduction  to  the  SL,\M  II  simulation  language 
is  presented  along  with  the  benefits  it  lends  to  the  modeler.  This  discussion  is  fol¬ 
lowed  by  Section  4.3.  which  defines  the  network  internal  and  external  environments 
and  the  operating  assumptions  which  have  been  used  to  facilitate  this  investigation. 
The  formulation  of  the  network  models  are  outlined  in  Section  4.4  with  an  in-dcptli 
discussion  of  the  channel  allocation-deallocation  problem.  The  approaches  used  in 
designing  and  simulating  the  multistage  cube  networks,  the  single-stage  cube  net¬ 
works.  and  the  Illiac  IV  mesh  networks  are  explained  in  Sections  4.5,  4.6,  and  4.7 
respectively. 

4-2  SLAM  II  and  Interconnection  Setu'ork  Modeling 

In  recent  years,  network  modeling  via  computer  simulation  has  been  imple¬ 
mented  primarily  in  high-ordered  languages  such  as  C  and  Fortran.  These  simu¬ 
lations.  such  as  [Ove82].  have  been  comprised  of  thousands  of  lines  of  source  code 
which  adds  to  the  complexity  of  the  modeling  effort  and  the  overall  time  recjuiied 
to  complete  the  effort.  To  reduce,  and  possibly  eliminate  the  necessity  of  high  order 
language  simulations,  Pritsker  and  .4ssociates,  Inc,  [Pri86]  developed  a  Fortran- 
based  language.  Simulation  Language  for  ,4lternative  .I/odeling  (SL.-XM  )  which 
ideally  suited  for  network  simulations.  The  compactness  and  completeness  of  it-- 
code  makes  SI..\M  modeling  desirable.  For  this  reason  and  due  to  the  experti-( 
available  for  designer's  questions.  SL.-\.M  was  chosen  as  the  tool  for  use  in  niodeliiiu 
the  interconnection  networks  of  this  investigation,  A  brief  d('scri[)t  ion  of  tin'  S1..\M 

.51) 


network  language  is  provided  to  assist  in  the  understanding  of  the  models  to  be 
presented  in  the  latter  sections  of  this  chapter. 

In  the  modeling  of  a  network,  the  SLAM  simulation  consists  of  two  parts:  the 
control  statements  and  the  network  description.  The  control  statements  pro\icle 
options  to  the  user  for  determining  the  initial  states,  any  modifications  to  the  sim¬ 
ulation  and  when  and  how  to  terminate  the  simulation.  The  network  description  is 
the  SL.AM  code  which  represents  the  modeler's  interpretation  of  the  actual  network 
process.  SLAM  provides  to  the  u..er  a  set  of  23  network  statements  which  allow  for 
in-depth  simulations  ranging  from  complex  computer  networks  [GarS.5].  to  computer 
interconnection  networks  [A1H86|. 

SL.'WI  provides  the  entity  which  is  used  to  model  a  message  that  will  flow 
through  the  network  in  a  store-and-forward  manner.  Each  entity  can  have  associated 
with  it  a  set  of  attributes  which  are  used  to  distinguish  one  entity  from  the  other. 
These  attributes  can  be  assigned  values  which  represent  source  addresses,  destination 
addresses,  message  lengths,  or  other  user  defined  values.  The  file  is  used  to  represent 
resources  such  as  channels  or  memory  modules  as  well  as  queues  which  store  groujts 
of  entities.  The  basic  concept  of  the  SLAM  model  is  to  have  the  entity(s)  generated 
at  some  prescribed  inlerarrival  rate  and  then  flow  through  the  network,  following 
the  routes  determined  by  the  designer.  Each  entity  which  enters  the  network  will 
be  also  be  terminated  and  removed  from  the  network.  Upon  termination,  statistic^ 
associated  with  the  entity  may  be  collected  when  specified. 

The  statistical  results  of  the  simulation  are  provided  for  in  SL.AM  via  a  sum¬ 
mary  report  The  summary  report  includes  statistics  on  the  files,  activities,  and/or 
variables  of  the  model.  The  summary  report  is  the  primary  output  of  a  SL.AM  simu¬ 
lation.  Additional  information  concerning  the  simulation  can  be  obtained  from  echo 
reports,  and  trace  reports.  Echo  reports  reflect  the  data  input  and  the  initial  values 
set  prior  to  execution.  The  trace  reports  are  primarily  used  as  debugging  and 


validation  tools  for  the  model.  The  trace  report  provides  a  snap-shot  view  of  tin 
network  at  each  instance  of  time  in  which  an  event  is  scheduled  to  occur. 

4.3  Network  Operating  Assumptions 

To  facilitate  this  investigation  of  the  mesh,  the  single  stage  cube  and  the  in  ;! 
tistage  cube  interconnection  networks,  certain  assumptions  and  operating  condition' 
for  the  networks  must  be  established.  Based  on  the  operating  assumptions  used  in 
previous  research  [DiJ8l],  and  from  the  discussions  in  Chapters  2  and  3.  the  oper¬ 
ating  assumptions  and  conditions  for  the  netw'ork  simulations  are  described  below. 

1.  Each  of  the  networks  to  be  modeled  are  assumed  to  be  operating  in  a  Ml.Ml) 
environment. 

2.  A  PE-to-PE  architecture  is  assumed. 

3.  Packet  switching  is  used  as  the  method  for  inter-PE  communications.  Message 
buffers  are  employed  at  the  switches  for  the  storing  and  forwarding  of  packets 
to  and  from  the  switches. 

4.  Message  interarrival  times  are  assumed  to  be  Poisson  proces,scs. 

5.  Generation  of  source  and  destination  PE  addresses  are  uniformly  distributed 
over  the  range  of  values  specified  by  the  number  of  PEs  in  the  network. 

G.  Messages  are  assumed  to  be  single  packets  in  length. 

7.  The  unit  of  measure  for  determining  the  average  message  delay  and  throughput 
of  the  network  is  the  packet  cycle  time.  This  is  the  time  that  it  takes  a  packet 
to  move  from  the  front  of  a  queue,  through  its  corresponding  switch,  and  arri\  e 
at  the  queue  associated  with  the  next  point  (switch)  in  its  routing  scheme.  For 
all  simulations,  the  packet  cycle  time  is  normalized  to  1  unit. 

8.  Network  outputs  can  process  messages  faster  than  the  messages  can  be  genei 
ated.  This  insures  that  the  output  device  will  not  be  a  bottleie  .k. 


9.  Message  buffers  are  infinite  in  length.  The  rate  in  which  packets  arrive  at  tlie 
input  queues  is  controlled  by  a  Poisson  process.  This  arrival  rate  determine.^ 
the  relative  load  on  the  network  along  with  the  average  and  maximum  si/e  ol 
the  crossbar  switch  queues  necessary  to  store  the  arriving  packets  for  a  gi\en 
rate.  Packets  entering  these  buffers  are  transmitted  on  a  first-come-first-sci  \  i- 
basis  (FIFO). 

4-4  Formulation  of  Sttu'ork  Mod(U 

The  focus  of  this  investigation  is  to  determine  how  two  network  performanci- 
parameters,  message  delay  and  network  memory  costs  compare  for  dissimilar  in¬ 
terconnection  networks.  Specifically,  by  comparing  these  parameters  for  the  mesh, 
single  stage  cube  and  multistage  cube  interconnection  networks,  the  desirability  of 
one  network  over  another  can  be  determined  for  an  arbitrary  network  size  at  various 
loads.  Recalling  from  Chapters  2  and  3,  the  switching  element  most  commonly  used 
for  connecting  autonomous  processors  is  the  crossbar  switch.  Since  current  tech¬ 
nology  allows  for  implementations  of  crossbar  switches  to  be  as  large  as  32-by-32 
[Szy86].  modeling  an  interconnection  network  using  crossbar  switches  appears  fea¬ 
sible  and  realistic.  I  sing  the  crossbar  switch  as  a  focal  point,  the  networks  to  be 
investigated  have  a  common  starting  point  in  the  modeling  effort. 

4-4-1  Tht  Allocation  Problem  As  aforementioned,  the  tool  used  to  sedve  this 
investigation  is  SLAM  II  [Pri86].  This  versatile  language  allows  the  user  to  design 
models  with  the  intricacy  and  diversity  limited  mainly  to  the  designer's  abilities,  l(j 
support  the  basic  language.  SLAM  allows  the  designer  to  write  code  in  Fort  ran  to 
perform  tasks  that  may  not  be  normally  supportable. 

The  initial  approach  to  modeling  the  three  interconnection  networks  is  to  ex¬ 
amine  a  relatively  small  network  with  the  smallest  crossbar  switch  available.  .\ 
fil-PE  multistage  cube  network  which  uses  2-by-2  crossbar  switches  is  chosen  as  the 


starting  point  of  the  modeling  effort.  The  first  obstacl<-  to  overconre  is  the  instaniia 
tion  of  code  for  modeling  a  set  of  parallel  and  autonomous  processes.  Two  methods 
existed:  for  an  .V-Pt,  system,  instantiate  the  create  process  for  messages  .\  times, 
which  would  cause  a  massive  duplication  of  code:  or  use  an  interarrival  rate  whiih 
Would  allow  the  .\  processes  to  he  simulated  using  one  message'  creation  node.  I  hi' 
latter  technique  is  chosen.  By  allowing  the  interarrival  rate  to  be  a  Poisson  process, 
mathematically  the  aggregate  of  the  inte-rarrival  rate  can  be  decomposed  to  prenidr 
the  rate  necessar\'  to  mode!  the  .V  parallel  processes. 

Using  this  type  of  interarrival  rate  provides  the  possiliility  to  mod<'l  a  netv.ork 
of  arbitrary  size  by  modeling  how  one  entity  (message)  flows  through  the  network. 
Ihis  further  allows  for  a  simulation  such  as  [Ove82j.  to  be  reduced  from  thousands 
of  lines  of  code  to  less  than  one  hundred.  By  resolving  the  parallel  code  instantiation 
problem,  this  investigation  proceeds  by  examining  the  next  major  obstacle,  modeling 
the  entry  and  removal  of  entities  (packets)  from  the  crossbar  switch  input  cjueues. 
This  process  turns  out  to  be  one  of  most  time  consuming  and  thought  provokinc 
aspects  of  this  investigation. 

To  appreciate  the  solution  to  the  problem  of  entity  entry  and  removal  from  a 
switch  queue,  it  is  necessary  to  explain  the  trials  and  errors  associated  with  olitaininc 
the  solution.  In  the  networks  that  are  to  be  modeled,  three  important  pieces  of 
information  are  absolutely  necessary:  where  the  packet  was  coming  from:  to  which 
queue  would  the  packet  be  routed;  and  to  which  outgoing  channel  would  the  packet 
be  routed.  Using  mathematical  relationships  to  be  discussed  in  the  sections  related 
to  each  particular  network,  the  above  three  requirements  are  obtainable.  Still  to 
be  resolved  is  the  allocation  of  an  available  channel  and  the  reallocation  the  same 
channel  when  it  is  applicable. 

Understanding  the  allocation  problem  encountered  requires  additional  infor¬ 
mation  on  how  the  SLAM  processor  works  ancl  d('scrii)t ion  of  the  Sl-.-VM  .■\\\.\ri 
node.  RESOUIU’l'.  block  and  I'  llKU  iukIc.  In  SL.-X.M.  all  e\ents  that  occur  assm  i 
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ated  with  entity  (packet),  from  creation  time  to  final  termination,  are  scheduled  on 
an  event  calendar.  This  event  calendar  is  a  doubly  linked  list  with  the  capability 
of  being  scanned  forward  and  backward.  V'hen  an  ee'ity  is  created,  statistics  are 
kept  on  that  entity  as  it  flows  through  the  network.  Any  waiting  that  the  entity 
encounters  (i.e..  queueing  time)  is  also  kept  in  tlu*  statistics  associated  with  the  en¬ 
tity  and  the  file  corresponding  to  the  queue  the  entity  entered.  The  event  calendar 
is  incremented  oidy  through  the  use  of  an  activity  which  specifies  the  duration  of  a 
particular  e\-ent  such  as  the  packet  cycle  time. 

Critical  to  this  iinestigation  is  tlie  control  of  the  communication  channels  lie 
tween  crossbar  switches.  Physical  implementations  allow  one  packet  to  be  using  a 
channel  at  any  instance  of  time.  Subsequent  requests  for  the  channel  will  be  granterl 
onl\'  wlien  the  channel  is  free.  Packets  requesting  a  busy  channel  must  be  queued 
or  dropped  from  the  system  in  accordance  with  the  packet  disposition  algorithms 
described  in  Chapter  .3.  Initially,  the  SLAM  RESOURCE  block  appears  to  be  tin' 
best  method  for  modeling  the  channels  connecting  the  crossbar  switches.  \  RE¬ 
SOL  RCE  block  allows  the  designer  to  specify  the  capacity  of  the  resource  and  which 
files  (queues),  if  any.  should  be  polled  when  the  resource  is  available.  I'sed  in  con¬ 
junction  with  a  RESOURCE  are  the  .WVAn  and  I'REE  nodes.  As  an  entity  flow> 
through  the  network,  it  can  enter  an  AW.Ml  node  recpiesting  access  to  a  particular 
RLSOl  RC'E.  If  the  RESOURCE  is  available,  the  entity  seizes  a  specified  number 
of  units  of  the  requested  RESOl  RCE  (in  the  case  of  the  networks  to  be  siiiudated 
one  unit  of  a  particular  resource  was  the  capacity)  and  procet'ds  to  a  serv  ice  arti\  il\ 
which  increments  the  event  calendar  and  the  entity's  time  in  the  system.  Ehe  st'i  v  in 
activity  specifics  the  duration  of  the  event  simulated.  Upon  completion  of  the  servic  i 
activity,  the  entity  flows  into  a  fREE  nod<>  which  deallocates  a  si)ecified  number  o! 
units  of  the  seized  RESOURCE.  Once  the  RESOURCE  has  been  dealloc  aieil.  S1..\.M 
automatically  checks  if  any  requests  for  the  RESOURCE  have  not  been  filled.  1  his  i- 
done  by  keeping  a  list  of  what  files  (f|ueues)  wanted  access  to  the  RESOURCE  when 
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it  was  busy.  If  requests  are  waiting,  this  list  is  used  to  determine  which  en'iiy  wil! 
be  allowed  to  allocate  the  RKSOI  Rf’K  next.  SL,\M  then  scans  the  filc'  as^-iK  laie.l 
with  the  resource  declaration  in  the  order  specified  by  the  declaration  statement .  .Xs 
an  example,  consider  the  following  cod«'. 

RESOURCE/ l.CHANlCl) ,113,121; 


AWAIT (QUE=65, 128) ,CHAN/l: 
ACT/1 , 1 ; 


FREE, CHAN/ 1; 

The  above  code  declares  a  resource  named  CII.X.M.  which  has  a  capacity  of 
one  unit  and  a  SL.-XM  file  number  equal  to  1.  Two  files  (queues)  associated  with 
CTTX.Nl  are  11.3  and  121.  This  declaration  tells  SL.A.M  that  files  113  and  121  are 
the  files  to  examine  in  that  order  when  CH.ANl  is  requested.  Pre\  ious  to  an  ei.tity 
flowing  into  the  .-WS.ATr  node,  \ahies  are  a.ssigned  to  QTE  and  CTlA.X  to  correspunii 
with  the  input  queue  of  a  crossbar  switching  element  and  the  output  channel  fiotti 
the  crossbar  switch  to  the  next  input  queue  in  the  entity's  routing  ])ath.  .Assume 
that  Ql'E  =  113.  and  CHAN  =  1  (indicating  CH.ANl  is  requested).  When  an  er.tii\ 
flows  into  the  AW  .AH  node,  if  CH.A.Nl  is  busy  then  the  entity  is  stored  in  file  1  1 .1  If 
CH.ANl  is  free,  then  the  entity  seizes  CHANl.  which  locks  out  subseepient  ;e(|uevt'- 
and  then  performs  the  following  activity  .AC’I/l.l.  This  activity  telE  the  Sh.Wl 
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})rocessor  that  the  next  event  for  this  particular  entity  is  to  be  scheduled  one  time 
unit  in  the  future.  After  one  time  unit  has  expired  the  entity  flows  into  the  FRlil. 
node  where  the  deallocation  and  reallocation  of  CHANl  is  determined.  If  additional 
requests  for  CIIANl  are  present,  file  113  is  first  checked  and  then  121  for  requestors. 
.•\s  long  as  requestors  exist  in  file  113.  file  121  will  never  be  checked.  For  the  purposes 
of  this  investigation,  the  sequential  manner  in  which  the  SL.A.M  processor  polled  the 
files  associated  with  a  RESOl  RC’P,  proves  inadequate.  This  is  due.  in  large  part,  to 
the  physical  operation  of  the  crossbar  switch. 

In  the  physical  operation  of  the  crossbar  switch,  one  method  of  determinirii: 
channel  allocation  is  the  longest- waiting-first  scheme.  In  each  of  the  FIFO  iiqint 
queues,  the  first  packet  is  polled  to  determine  if  the  packet  desires  the  free  channel. 
I  he  oldest  packet  of  the  ones  waiting  for  a  particular  channel  is  then  chosen  for 
removal  from  its  queue  and  allowed  to  traverse  the  channel,  blocking  others  waiting 
until  its  traversal  is  complete.  This  is  the  desired  approach  for  implementation  in  the 
.ALLOC  subroutine.  Conferring  with  Pritsker  and  .Associates  Technical  Consultant^ 
iPriS  7i.  assurance  was  given  that  this  type  of  irnphunentation  was  possible  but  that 
presently,  there  did  not  exist  documentation  of  attempts  to  perform  this  conqilex  of 
an  allocation  scheme  using  the  .ALLOC  routine  and  RFSOFRCFs. 

.At  each  crossbar  switch,  the  number  of  input  qumies  to  the  switch  are  de¬ 
pendent  upon  the  number  of  PFs  in  the  network  and  the  tyjie  of  interconnection 
network  modeled,  do  model  the  physical  operaticni  of  allocating  a  channel  from  one 
of  many  input  tjueues.  an  alt<'rnati\e  to  the  setpieut  lal  nu’thod  of  file  [lolling  is  s(.>ught , 
The  SL.AM  [irocessor  provides  to  the  user  the  [lossibility  for  user  defined  allocation 
schemes  via  the  .ALLOC  subroutim*.  1  his  loirtran  subroutine  allows  Rir  com[i!ex 
allocation  schemes  which  are  normally  not  supported  by  the  SL.AM  processor. 

Pursuing  the  (hatmel  allocation  scheme  using  the  .ALLOC  siituouiine  lequioo 
that  this  routine  be  lalled  each  time  one  of  th<’  following  two  cases  oii  ui'.  1  ii't, 
ALLOC  is  calleii  v\  h  en  an  entitv  ariives  at  an  .AW.-Ml  node  and  sei  oiidK.  xihen  an 


entity  arrives  at  a  FREE  node.  This  adds  extra  burdens  on  the  designer  to  insure 
that  the  correct  logic  is  executed  when  ALLOC  is  called.  The  implementation  of  the 
logic  cissociated  with  the  cal!  to  ALLOC  at  an  AW.-MT  node  is  relati\ely  straight 
forward.  If  the  channel  is  free  and  the  queue  in  which  the  arriving  entity  is  associated 
with  is  empty,  let  the  entity  seize  the  RESOURCE  and  pass  through  the  .ANNWll 
node.  If  the  RESOURCE  is  busy  or  the  input  queue  is  not  empty,  then  file  the 
incoming  entity  in  the  appropriate  queue.  The  complex  portion  of  this  allocation 
scheme  occurs  at  the  FREE  node. 

.At  the  FREE  node.  ALLOC  is  called  to  determine  if  an  allocation  can  L>e 
made.  For  this  portion  of  ALLOC,  the  determination  of  the  oldest  entity  waiting  for 
the  channel  to  be  freed  is  to  be  made.  While  the  designer  is  allowed  to  manipulate 
the  SLAM  files  to  extract  the  oldest  waiting  entity,  the  unsolvable  part  of  this  al¬ 
location  comes  from  the  inherent  limitations  of  SLAM  and  the  sequential  polling  of 
the  files  associated  with  the  RESOURCE  declaration.  For  an  entity  to  flow  through 
the  network,  the  entity's  attributes  must  reside  in  the  ATRIB  buffer.  This  buffer 
contains  the  attributes  of  the  entity  which  is  currently  traversing  the  network.  These 
attributes  remain  in  the  .ATRIB  buffer  until  the  entity's  traversal  is  forced  to  stofi 
due  to  a  Ql  El  E.  AWAIT,  or  FREE  node.  At  any  of  these  nodes,  the  attributes  of 
the  arriving  entity  arc  filed  in  the  appropriate  file  with  the  ATRIB  buffer  assuming 
the  attributes  of  the  entity  whose  rank  is  one  and  resides  in  the  first  nonempty  file 
associated  with  tlie  corresponding  RESOURCE  block.  The  fatal  error  with  attemjit- 
ing  to  use  ALLOC  to  solve  the  allocation  problem  is  that  if  the  oldest  entity  that 
requests  the  channel  is  not  the  same  entity  as  the  one  placed  in  the  .A  1  lUH  bnlfei. 
then  one  of  two  situations  can  occur.  First,  any  attempt  to  place  the  oldest  entity  in 
the  ATRIB  buffer  results  in  the  loss  of  the  entity  whose  attributes  previously  resided 
in  the  ATRIB  buffer  Second,  realizing  that  the  oldest  entit\'  and  the  entit\  in  tlm 
.A1  RIB  buffer  do  not  match,  (he  designer  of  .ALLOC  informs  thi’  SL.AM  processor 
that  no  allocation  can  lie  made.  This  second  situation  causes  the  SL.\M  proces-or 
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to  attempt  another  allocation  by  calling  ALLOC  a  subsequent  time.  If  the  entities 
do  not  match,  the  processor  attempts  to  free  the  RESOURCE  associated  with  the 
entity  in  the  ATRIB  buffer  causing  an  error  to  occur  in  deallocating  a  RESOURCE 
that  had  not  been  allocated.  Exhausting  possible  “tricks"  to  fool  the  SL.A.M  proces¬ 
sor  into  allocating  the  oldest  entity  via  ALLOC,  and  due  in  large  part  to  the  time 
factor  involved  with  this  investigation,  an  entirely  different  approach  is  considered 
in  attempting  to  solve  the  allocation-deallocation  problem. 

4-4-^  The  Allocation  Solution  .Abandoning  the  attempt  to  model  the  net¬ 
works  via  the  use  of  RESOURCES,  the  method  pursued,  with  the  solution  following 
is  to  discretely  code  the  actions  of  the  AWAIT  and  EREE  nodes  in  an  EN’ENT 
node.  The  EVE.N’T  node  is  provided  in  SL.4M  to  allow  free  access  to  file  manipu¬ 
lation  with  the  exclusion  of  the  event  calendar  file.  When  an  entity  flows  into  an 
E\’ENT  node,  its  attributes  remain  in  the  ATRIB  buffer  through  the  entire  opera¬ 
tion  of  the  logic  associated  with  the  E\'ENT  node.  This  aids  in  the  solution  of  the 
allocation-deallocation  problem  since  the  switch  input  queues  and  outgoing  channels 
arc  attribute-based  calculations  which  are  carried  by  the  entity  as  it  flows  through 
the  network.  Used  in  conjunction  with  the  EVE.N'T  node,  is  the  ENTER  node,  which 
allow.'  for  selec  t  ive  entity  entry  in  to  the  network.  The  basic  function  of  the  E\'EN'  1 
node  is  to  rlete'-niine  the  oldest  waiting  entity  and  allow  for  its  placement  into  the 
network  by  the  E.N  I  ER  node.  The  algorithm  flowchart  for  the  E\’EN'T  node  is 
sIkiwii  in  I-  mure  1.1. 

In  place  of  the  RESOURCE  block  used  to  model  the  inter-switch  box  commu¬ 
nication  channels,  are  attribute-based  .ACTIVITIES.  Each  .ACTI\’ITA’  is  uniquely 
defined  by  an  integer  number  ranging  from  one  to  the  number  of  channels  in  the  net¬ 
work.  Much  like  the  RESOURCE,  an  ACT1\  ITA  can  be  accessed  and  held  by  only 
one  entity  for  the  duration  specified  by  th«‘  ACTI\  IT\  .  Unlike  the  RESOURCf,, 
the  ACTIVITY’  has  no  automatic  reallocation  process;  it  is  released  at  the  end  of 
the  time  interval  sjrecified  in  the  ACTI\  1  lA  statr'nu'nt  and  is  free  to  l)e  ar  ( (''sci!  by 
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any  entity  requesting  it.  For  this  investigation,  using  the  ACTI\’IT\'  to  model  the 
channel  proves  to  be  the  ideal  solution. 

4-5  The  Multistage  Cube  Network  Model 

The  multistage  cube  interconnection  network  was  chosen  as  the  first  network 
to  be  modeled  due  to  the  need  to  compare  and  validate  the  simulation  results  with 
previously  published  research  [DiJ81,  KrS83].  These  results  will  be  discussed  in 
detail  in  Chapter  5. 

Throughout  the  modeling  of  the  three  chosen  interconnection  networks,  mathe¬ 
matical  relationships  were  sought  for  the  appropriate  network  parameters.  Achieving 
these  types  of  relationships  allowed  for  the  source  code  to  be  compact  and  easy  to 
follow  logically.  Along  with  the  use  of  mathematical  relationships,  the  use  of  Poisson 
interarrival  rates  allowed  the  SLAM  source  code  needed  to  simulate  an  arbitrarily 
sized  multistage  cube  network  to  be  less  than  50  lines.  Figure  4.2  shows  the  SL.4M 
graphic  representation  of  the  multistage  cube  network. 

Associated  with  each  generated  entity  (message)  is  a  set  of  nine  attributes 
which  distinguish  the  network  entities.  These  attributes  are  defined  as  follows: 

1.  ATRIB(l)  ;  the  lime  of  entry  into  the  system 

2.  ATRIB(2)  alias  SRCE  :  the  source  address 

3.  ATRIB(3)  alias  DEST  :  the  destination  address 

4.  ATRIB(4)  alias  STAGE  :  the  present  stage 

5.  ArRIB(5)  alias  Ql  E  ;  the  input  queue  of  a  crossbar  switch 

6.  ATRIB(6)  alias  CHAN  :  the  outgoing  channel  number 

7.  ATRIB(7)  alias  BITS  :  the  indicator  bits  that  are  passed  to  the  E\  ENT  node 

to  determine  the  queues  that  should  be  scanned  at  a  particular  crossbar  switch 

8.  ATRIB(8)  alias  INTIME  :  the  time  an  entity  enters  the  E\  ENT  node 

9.  ATRIB(9)  alias  DUMMY  :  dummy  flag  attribute  for  entering  the  E\  ENT  node 

Also  associated  with  the  network  simulation  are  three  global  variables.  I  Ik'sc 
variables  are  used  throughout  the  simulation  with  their  values  remaining  stati<'. 
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Figure  4.1.  E\  ENT  node  flowchart  for  allocation- dealloral  i(;n  process 
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1.  XX(1)  ;  the  number  of  input  queues  at  a  crossbar  switch 

2.  XX(2)  :  the  number  of  routing  tag  bits  that  must  be  examiyied  for  each  pass 
in  USERF ( 1 ) 

3.  XX(3)  :  number  of  nodes  in  the  network 

Eac})  simulation  begins  with  the  creation  of  a  packet  at  time  zero  and  then- 
after  at  a  rate  specified  by  the  Poisson  interarrival  times.  I'pon  creation,  a  packet  is 
assigned  a  source-destination  address  pair  based  on  an  uniform  distribution  whose 
range  of  values  is  restricted  by  the  number  of  nodes  in  the  network.  Next,  the  packet  's 
apjiropriate  attributes  are  assigned  a  value  for  the  present  stage  and  the  initial  input 
queue.  The  packet  continues  to  flow  into  a  .sequence  of  statements  which  assign  and 
reassign  values  to  the  outgoing  channel,  the  present  stage  and  the  input  queue  to 
the  next  stage.  This  sequence  constitutes  a  loop  which  steps  the  packet  through 
the  network  toward  its  destination.  The  packet  exits  the  loop  when  it  reaches  its 
destination,  .^t  this  point,  statistics  are  gathered  on  the  packet's  time  in  the  system. 

Within  the  loop  is  the  packet  flow  control,  the  E\’ENT  node.  The  E\’E.\T 
node  controls  the  number  of  stages  that  a  packet  is  allowed  to  flow  through  unre¬ 
stricted  before  it  must  be  stopped  and  filed  in  the  appropriate  queue. 

4.5.}  Packft  Pouting  in  thf  Multistagf  Cube  Modd  As  described  in  Chapter 
2.  the  packet  routing  in  a  multistage  cube  interconnection  network  is  deterministic. 
For  each  stage  in  the  network,  address  lines  are  grouped  at  switch  boxes  according 
to  the  C'ubi,  function.  For  example,  at  stage  3.  of  a  Hi-node  network,  addre^-  line- 
0  and  8  are  grouped  together  in  a  2-by-2  crossbar  switch.  The  routing  to  the  next 
stage  is  dependent  upon  the  source-destination  pair  and  the  routing  algurithm  usi-d 
(XOK  or  destination  routing). 

Of  particular  importance  in  determining  the  outgoing  channel  are  the  current 
stage  number,  the  bits  to  be  examined  in  the  address,  the  size  of  the  crossb.u  su-it(  h 
( i.e..  4  for  a  4- by-  1 ).  and  the  source  dest  in  at  ion  address  ]>air.  Imiilement  ing  the  tout 
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Figufe  4.2.  SL.^M  graphic  representation  of  a  multistage  cube  network 


ing  algorithm  in  SLAM  requires  that  the  algorithm  be  discretely  coded  in  l  ortraii. 
SLAM  supplies  to  the  user,  a  function  I'SERF,  which  allows  for  decision  modclint: 
which  is  not  supported  by  the  basic  constructs,  f  sing  I’SERF  and  ATHlB(5i.  the 
equation  for  the  outgoing  channel  is  given  by: 

CHAS  =  QI  E^I  SERF[\)  (-4,]  i 

l’SERF(l)  determines  the  relative  position  of  the  outgoing  channel  based  on 
the  relative  position  of  the  input  queue  where  the  packet  is  currently  residing  and  the 
destination  address.  These  relative  positions  are  determined  b\’  examining  particulai 
bits  of  the  source  and  of  the  destination  addresses.  The  range  of  \alues  for  the  relat  i\  e 
positions  is  limited  by  the  size  of  the  crossbar  switch.  Once  the  relat i\('  ijositituis 
have  been  determined.  USERF  returns  a  value  to  CH.\N  based  on  the  followinc 
equation. 

I  SERF  =  {dtsttnaiwUrp -■  source j-p)  x  (si:(“^^^)  !  l.'Ji 

where 

dcstinatioTXrp  is  the  relative  position  of  the  destination  address. 
sourccrp  is  the  relative  position  of  the  source  addr('ss. 
size  is  the  crossbar  switch  size. 
stage  is  the  current  stage  nutnber. 

Once  the  outgoing  channel  is  determined,  the  calculation  of  the  iiijuit  (jneni' 
at  the  next  stage  can  be  made.  I’sing  an  internal  numbering  relationship,  the  iiipiii 
queue  number  at  the  next  stage  is  equal  to  the  c.iutcoing  channel  number  from  tlie 
[)revious  stage.  By  numlx-ring  in  this  manner.  onl>  one  calculation  has  to  1m'  made 
for  any  channel-cpieue  pair. 


4-6  The  Single  Stage  Cube  Netu'ork  Model 

Using  an  approach  similar  to  the  one  used  in  modeling  the  multistage  cube  in¬ 
terconnection  network,  the  simulation  code  for  the  single  stage  cube  interconnection 
network  of  arbitrary  size  is  compact  and  modular.  The  SLAM  graphic  representation 
is  shown  in  Figure  4.3. 

As  in  the  multistage  cube  interconnection  net  work,  the  single  stage  cube  mode! 
has  nine  attributes  associated  with  each  entity  (packet)  that  is  placed  in  the  netwt)ik. 
These  attributes  are; 

1.  ATRIB(l)  :  the  time  of  entry  into  the  system 

2.  ATRIB(2)  alias  SRCE  :  the  source  address 

3.  ATRIB(3)  alias  DEST  :  the  destination  address 

4.  .‘-\TRIB(4)  alias  BOX  :  crossbar  switch  number 

5.  ATRIB(5)  alias  PSRCE  :  previous  source  number 

6.  ATRIB(6)  alias  CHAN  :  outgoing  channel  number 

7.  .‘\TRIB(7)  alias  QUE  :  the  input  queue  number 

8.  ATRIB(8)  alias  INTIME  :  time  entity  enters  E\’ENT 

9.  ATRIB(9)  alias  DUMM\’  :  dummy  flag  for  E\  ENT  node 

Also,  four  global  variables  are  needed  for  the  simulations.  These  \'ari.iblc.' 
remain  static  throughout  the  simulations  and  are  d'  ■  d  as  follows: 

1.  XX(1)  :  the  network  size 

2.  XX(2)  ;  the  size  of  the  crossbar  switch 

3.  XX(3)  :  the  number  of  channels  in  the  network 

4.  XX(4)  :  bit  counter  for  l'SERF(2)  and  rSERFt  l) 

The  creation  of  packets  into  the  network  is  controlled  by  a  Poisson  y)rocess. 
Once  created,  f)arkets  ar<'  assigned  source-destination  addrc'sses  based  on  a  uniform 
distribution  over  the  range  of  the  number  of  processor  in  the  network.  The  assicn 
merit  of  the  initial  input  queue  follows  the  source-dest  inat  ion  assignments.  l  ollo\s  ini: 
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these  initial  assignments,  the  packet  enters  a  section  of  code  which  steps  the  jiarkei 
through  the  network  toward  its  destination  node.  In  this  section,  the  outgoing  chan 
nel,  the  next  intermediate  source  and  the  next  input  queue  are  determined.  How 
far  an  entity  proceeds  through  the  network  is  determined  in  the  FIX  KN  I  node  de¬ 
scribed  previously  in  Section  4.4.  When  an  entity  reaches  its  destination,  stati'-tic^ 
are  collected  on  the  total  time  the  entity  spent  in  the  network.  The  t'ntity  is  then 
terminated  and  removed  from  the  system. 

4-6.1  Packft  Routing  tn  the  Singh  Stage  Cube  Modd  Recall  from  Chapo  : 
2.  that  in  an  r?i-dimension  single  stage  cube  interconnection  network,  each  prore-^,,i 
is  connected  to  ni  other  processors  by  the  Cube,  interconnection  functicui.  I  hi- 
recjuires  that  the  crossbar  switch  at  each  node  be  of  size  ;n  u-  1 .  Once  a  })acktt  i' 
placed  at  a  node,  a  routing  algorithm  must  be  employed  to  move  the  packet  to  the 
next  node  in  the  routing  path.  The  Intel  iPSC  uses  a  '‘hardwired"  method  :lnt>'(r 
of  routing  using  one  of  seven  local  area  network  (L.XN)  chaniuds  to  move  the  packet 
from  one  node  to  the  next.  A  particular  output  channel  is  dependent  u])on  tlif 
current  node  and  the  next  node  i[i  which  the  packet  is  to  be  routeii. 

.Mathematically,  the  routing  algorithm  of  the  il’SC  performs  an  exchisive-or  of 
the  source-destination  addresses  and  then  scans  the  re'sultant  bits  until  it  encounii'rs 
a  logir  1.  The  L.\N  channel  to  be  used  is  the  bit  jxjsition  of  the  first  1  encounH'red 
when  sranning  from  least  significant  bit  to  most  significant  bit.  Iransmit t ing  the 
[jacket  along  the  chosen  LAN  chaniH'l  j)Iaces  the  packet  at  a  new  souro'  node,  wheir 
the  abo\’e  process  is  rejjeated  until  the  new  source  is  ecpial  to  the  final  destination 

address, 

I  he  abov('  routing  algorithm  is  use<l  in  the  simulations  of  the  single  stagi' 
cube  interconnection  network  models.  Besides  having  to  know  the  conimunicat  ions 
cliannel,  knowing  the  new  scuirce  and  the  new  injuit  cjiieue  are  also  ni'c'ded.  I  lu'se 
parameters  are  gi\en  liy  the  following  <-(piations; 


CHA.X  =  SIZE  X  SRCE  ~  (SIZE  -  3)  +  rSERFili 
SRCE  =  ESERE(3) 

QEE  =  SIZE  X  SRCE  +  2  +  1  SERE{A) 


where  SIZE  is  the  size  of  tlie  crossbar  switch 

L  SERF  (2),  LSERFtS),  and  USERF(4)  are  user  written  Fortran  functions. 


l'SERh'(2)  returns  a  value  between  - 1  and  SlZF-2  based  on  the  bit  wi^e  e\( 
or  of  the  source-destination  addresses  and  the  occurrence  of  a  1  in  bit 
I.  When  the  source  and  d<‘stination  addrc'sses  are  tlie  same,  a  I  i-  re’i!;'Mi!  ii- 
rSFI{F(2)  indicating  that  tin*  packet  has  reached  its  destination.  If  the  vuuice  ad 
dress  does  not  equal  th<’  destination  addr<‘ss.  a  [)ositive  \alue  corresiuiufling  to  tin 
bit  position  of  the  first  1  encountered  is  returned  by  l'.SFRF(2i. 

1  SfJfl'i.'I)  determiiu's  the  next  no(i<>  in  the  routing  s('ciuenre.  .Xu  aiiproaej 
similar  to  the  one  used  in  rSFHF(2)  is  us<‘d  in  FSF.RFt.'f).  .X  bitwi>e  exc  lu'iv  o 
is  [lerforrned  on  the  source-destination  address  pair  with  the  bit  po'-ition  of  the  In- 
I  encount<Ted  recorded.  I  he  next  lu^de  is  then  determined  by  one  of  the  foUown, 
t  Wo  eijuat ions 


WI  .\CDE  =  sourcf  4-2*’  sourrt  <  i/r  .sp  nuPeu 

.\.\T.\OI)E  =  .''ourcf  —  2  ’  xniiii  >  dt  nm  t  n 'h 


where  bp  is  the  bit  [position  of  the  first  1  encountered 


I  SKRI'lli.  in  part,  lahula'es  the  input  (pieiie  at  the  new  s,,i:;ii-  1  la  ,ij, 
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SIZE— 2.  These  three  I'SP'RF  functions  used  in  conjunction  with  tiie  current  enti".  '‘- 
attributes,  give  mathematical  relationships  for  the  parameters  needed  in  simulaiinu 
the  single  stage  cube  network.  This  allows  for  the  simulation  of  arbitrary  sized  net 
works  with  little  change  necessary  to  the  SLAM  source  code  and  no  chance  tcj  the 
Fortran  functions  or  subroutines. 


.^  .7  7/if  Mfsh  twoj'k  Modf  I 

I  he  third  interconnection  network  modeled  in  this  investigation  is  the  llliac  I\' 
iiiesii  type  network.  1  he  llliac  I\’  difh'rs  from  a  mesh  network  in  that  the  juoi  es'inc 
elements  that  lie  on  the  edges  of  the  network  are  ci>nnected  to  tUher  edce  pro( 
elements  in  a  "wraji  arfiiinfl  "  manner  describe<i  by  the  four  llliac  I\  )nterc.uinei  iio;i 
1  unct  ions  dfditif'd  in  Chapter  2.  .As  with  the  mult i stage  cube  int ere  omiei  t  ion  net  'as a  1, 
ami  t  he  single  stage  cube  interconnection  network,  the  llliac  I\'  network  Uses  cro"!);ii 
switchf's  for  the  intf'rconnect ion  of  [trocessing  ekunents.  W  here  the  Illiai  1\  network 
dilfers  from  the  other  tw(.>  m’tworks.  is  in  tin.-  size  of  the  switch.  In  th*'  llliac  1\' 
tietwork.  the  size  cT  the  crossbar  switch  is  fixed  at  a-by  '  for  an_\  size  net\^'o^k  The 
.''..’e  is  tixei!  due  to  tlie  structure  of  the  Uli'sh  type  network,  where  an\  j)roie";;.ij 
element  is  cliK'i  t !  \'  contiec  teci  to  four  neighboring  ])ro<  cssnig  eletnent', 

Ihe  SLAM  moclelmg  approach  for  the  llliac  1\  intercotinec  t  icui  iie'wi.ik  o 
identical  to  tlie  apprcjac  h  iisecl  ui  moclelinc  ihe  single  stage  rube  interccii,:iei  1 1.  c 
network.  .Mi  entity  attriluites  ami  global  .\X  \arialiles  are  cleliiiecl  in  the  satuo 
n.anner  as  in  the  single  stage  ctilie  inoclel.  Ihe  two  interconnection  netwi.iik'  ddfe: 
in  their  respecticf  interc  connect  ion  function  implementations. 

T  7.  /  Packtt  Routing  in  tin  Illuir  l\  Modi  I  Ihe  SL.A.M  implementation  o‘ 
the  packet  routing  in  the  llliac  I\’  network  is  performecl.  m  part.  \<y  tlie  ;;s,'  ,,f  ;h;,, 
user  written  Fortran  functions,  CSFR  lo  i,  CSFR  |  !■  ,  .uui  Csl  R1  7  1  i..  .. 


fuiu  lions  return  values  to  the  following  three  equations  to  determine  tlir  outL’i 
channel,  the  next  source  node,  and  the  input  queue  numher  at  the  next  sourer  n. 


CHA.\  =  SIZE  ^  SRCE  -  (SIZE  - 
SRCE  =  ESERE(b) 

QEE  =  SIZE  X  SRCE  +  _>  +  rsERE\  7 1 


The  user  written  function  LSERI'(5)  determines  the  next  node  in  the  routii.u 
sequence.  This  is  done  by  first  performing  the  four  llhac  1\’  interconnection  fune- 
tions  to  determine  if  the  current  node  is  directly  connected  to  the  destination  node. 
If  a  direct  connection  exists,  then  the  new  sourc<‘  node  assumes  the  \’alue  of  thr 
destination  node  address.  If  the  current  node  is  not  directly  connected  to  the  desti¬ 
nation  node,  then  l'SERF(5)  returns  a  value  for  the  new  source  node  wliich  is  equal 
to  one  of  the  four  neighboring  nodes.  The  neighboring  node  chosen  is  determined 
from  a  set  of  rules  which  finds  the  minimum  difference  between  the  destination  node 
and  each  of  the  fiuir  neighboring  nodes  of  the  current  source  node.  Thi'  minimum 
dilfereiice  insutu's  that  the  routing  path  from  an\'  sourer'  to  an_\  dest  inat  a  ii.  node  > 
tin-  shortest  path.  In  the  event  that  multiple  shortest  paths  exist,  thetii't  sl;,,:',  ^' 
path  encountered  is  tlu'  one  taken. 

1  he  out  goi  rig  channel  e<|uat  ion  is  delined  atio\e  and  uses  1  Sf.Hl  to  li'caliu 
late  the  relat  i\'e  posit  ion  of  the  otit going  <  hannel  at  t he  crossbar  swii (  li.  Idiu.'a  .ilt'. 
I  S l'T{  1- ( ■')  u  rSERl  (())  first  d<*termines  if  a  direi  t  connection  exists  l>et\\e<  ii  'o'n.  !■ 
and  destination  nodes.  If  the  direct  connection  r'xists.  I  SIcRhifo  ii  turii-'  1  t"  m 
dicate  that  the  par  k<M  has  reached  its  destination.  If  the  current  iiodo  i'  no',  .  ij  i.d 
'  o  the  dest  inat  ion  luxie,  rS  1',  H  f  I  f)  I  ret  urns  a  wdue  bet  u  f^ui  ii  and  .5  i  ni  lusi  \  >  \\  n  ii 
indicates  the  relative  p<i--iiii)ri  of  the  ontL’oiiie  channel  at  the  c  rcosb.ti  '■wi’i  ti 


I  ShHl  iTi  i>  U'-cd  til  assist  in  th<‘  call  ulat ion  of  the  input  cpieur  nuiii!>i‘:'  at 
till'  next  source  iiuiie's  rros^Kar  switch.  Its  functinn  iiientical  to  I  SliHI  itii  except 
the  range  of  \alue^  returned  i-^  limited  from  ii  ii>  d  im  luMve.  VSl'.Hl' ( 7  •  i:i\  e^  tin 
relative  position  of  the  input  queue  at  the  next  source'v  cru'-^har  switch. 

Surrnriartj 

In  tills  rhajiter,  tlie  met  hoiiolotty  used  in  formnlatint:  the  multisiacc  cul'c  tin 
'iticle  stane  culie  and  the  llliac  1\  interconnection  networks  was  discu'sid  in  detail 
In. I  e,i.  h  ot  the  iietU'irk"  examined,  a  unique  SL.\M  model  was  deMimed  at.d  im 
piemented.  The  proldem,^  encountered  in  initial  attempts  to  model  t  he'e  network- 
uere  also  di'i  ussi'.l  to  lend  iiisicht  to  future  suiTi  itivestigations,  ']  In'M'  mod''!'  il 
t.e  U'cd  to  determine  the  performance  o{  one  network  versus  the  oilier  wh'  ;,  coi: 
sideiuit;  the  average  time  that  a  packet  spend  in  the  network,  1  his  perfot  n.aini 
meaoire  i^  discussed  m  dc'tail  in  Chapter  n. 


.=7>7>7jv^>7>7>7>7>T^-  V '.“  V 'j-  V  -J'  .-  V 


5.  Network  Simulation,  Validation,  and  Performance  Comparisons 
5.1  Introduction 

This  chapter  presents  the  simulation,  validation,  and  comparisons  of  thi-  thi'  ' 
interconnection  networks  modeled  in  Chapter  4.  Section  5.2  presents  a  discussion  o 
the  time  and  machines  required  to  perform  the  network  simulations  along  \^ith  tin 
techniques  used  to  ensure  valid  statistical  representations  of  the  network  models,  f  In 
network  validation  procedures  are  discussed  in  Section  5. .3.  The  delay  rharacten-t n  ■ 
of  each  network  are  presented  in  Section  5.4.  In  this  section,  individual  network 
d('lay  characteristics  are  examined.  The  delay  characteristics  of  the  three  nvtwoik- 
are  compared  and  evaluated  in  Section  5.5.  Network  buffer  requirements  alone:  uit' 
the  memory  costs  of  each  network  are  also  pr<*sented  in  Section  5.5. 

■;.J  .\(twork  Modi !  Snnulatunis 

I  his  in%'estigation  consists  of  four  phases:  the  network  morielinn.  tho  nctvMir'r 
‘simulations,  the  network  \'alidat ions.  an<l  the  i)erformanre  comparison-  of  tho  nm 
Works  modeled.  Ila\'ing  discussed  phas<*  one  in  Chapt('r  4.  phase-  two.  th.irf.  ai:' 
lour  remain.  I  his  section  presents  the  approach  taken  in  simulating  the  thiee  n.’e; 
Connection  network  modi'ls. 

In  at  tempt  in  g  any  large  scale  simulat  ion.  one  of  the  major  concern-  t . .  !  .i-  c, 
-idered  is  if  the  facilitie-  are  capable  of  supporting  the  simulation.  I  hi-  inv  e't  s  u 
is  l)elie\e(l  to  be  the  first  at  .M-II  to  attempt  such  a  large  scale  simulat  ion  U':iu 
the  simulation  language  SL.A.M.  In  rtiodeling  th<‘  interconnection  network-  cho-en,  . 
wide  range  of  memory  requirements  are  needed.  .Also,  the  jirocessing  -peed  of  th 
machine  is  critical  in  a  time-constrained  in\estigat  loii  -inh  as  this  one.  In  ,tde.  uien 
t  lie  ant  icif>ated  load  ing  on  the  -vsl  <’m  <lue  to  ot  her  u-'  i-  of  i  rit  ic.il  mi  ei  r- '  .  \  !:  ■ 

t  lie-r  fact  ors  are  m-ue>  t  h.it  must  be  resear<  heil  prior  to  attempting  i  he  -m,  > 


For  the  network  models  to  be  simulated  in  this  investigation,  onlv  one  ma¬ 
chine  is  available  to  best  handle  the  issues  raised  above.  The  ICC,  an  ELxsi  su¬ 
per  minicomputer-class  machine,  provides  the  real  and  virtual  memory  requirement, 
processing  speed  and  system  load  capable  of  supporting  simulations  where  tens  of 
thousands  of  SL,4M  files  are  used.  I  he  ICC'  is  a  dual  processor  machine  which  has 
processor  speeds  of  6  and  12  mip‘-  (million  instructions  per  second).  Virtual  and 
real  memory  storage  is  capable  of  handling  concurrent  processes  on  the  order  of  oil 
Megabytes.  The  processing  speeds  of  the  icc  are  7  and  15  times  faster  than  the  \’ax- 
class  machines  previously  u,sed  for  SL.AM  simulations.  This  results  in  simulation 
turnaround  times  being  reduced  from  cpu  hours  to  cpu  minutes. 

To  get  a  “feel"  for  the  size  of  the  simulations  to  be  performed,  it  is  necessary 
to  give  a  few  of  the  parameters  that  are  specific  to  a  given  simulation.  To  simulate 
any  job  in  SLAM,  certain  parameters  must  be  specified  prior  to  the  simulation.  For 
example,  associated  with  a  SLAM  job  is  the  Fortran  subroutine  M.\IN.  In  M.\1N. 
the  network  size  parameters  are  defined  and  memory  allocated  according  to  ‘hese 
definitions.  These  parameters  range  from  the  arrays  which  contain  the  attributes  of 
every  entity  that  enters  the  network,  to  arrays  which  specify  the  maximum  nu.,;bcr 
of  files,  activities,  and  resources  allowed  by  the  latest  configuration  of  the  SL.\M 
executable  code.  One  of  the  most  important  set  of  arrays  are  the  .XSET/QSFl 
arrays.  The  NSET/QSET  arrays  specify  the  array  sizes  that  must  be  "set  aside’" 
to  store  the  attributes  of  the  entities  which  enter  the  network.  Refer  to  [PriSti]  for 
a  in-depth  explanation  of  NSET/QSET  and  its  value  derivations,  f'or  the  large<t 
networks  to  be  simulated,  the  .NSET/QSET  arrays  are  set  to  8,000.000.  This  in  turn 
causes  the  network  model's  executable  code  to  be  approximately  .35  Megalivtes  in 
size.  Also  of  interest  is  the  execution  time  required  to  simulate  various  size  networks, 
l  or  networks  comprised  of  64  PEs,  approximate  simulation  ti’^^es  are  on  the  order 
’f  15  cpu  minutes.  Large  networks,  simulating  1024  PEs  r<  re  cpu  tiiiK's  on  the 

of  40  cpu  hours.  The  turnaround  limes  of  these  network  simulations  are.  to  a 


large  part,  dependent  on  the  load  placed  on  the  system  by  other  users.  Large  jobs 
take  on  average,  1  to  l|  calendar  days  per  simulation  run. 

Defining  the  network  parameters  to  be  investigated  proves  to  be  the  first  step 
to  be  taken  in  simulating  the  interconnection  network  models.  For  this  investigation, 
two  network  parameters  are  focussed  upon:  the  average  message  delay  and  the  buffci 
memory  cost  requirements  of  each  network.  The  average  message  delay  is  defined  as 
the  time  required  by  a  message  to  traverse  the  network  from  input  PE  to  output  PE. 
The  buffer  memory  cost  is  the  product  of  the  total  number  of  buffers  in  the  network, 
the  maximum  buffer  length  (length  required  to  ensure  that  999c  of  the  time  that 
the  network  is  in  operation,  this  length  will  not  be  exceeded),  and  the  unit  cost  per 
memory  size  (assumed  to  be  constant  at  one  unit  for  this  investigation). 

Upon  defining  the  network  parameters  of  interest,  message  delay  curves  for  the 
three  networks  must  be  generated.  The  generation  of  these  curves  proved  to  be  time 
consuming  and  tedious.  For  each  of  the  curves  to  be  generated,  multiple  simulation 
runs  are  associated  with  each  discrete  point  on  the  delay  versus  network  loading 
curve.  With  each  new  iteration  of  the  simulation,  new  seed  values  for  the  random 
number  generators  must  be  given.  These  random  number  generators  are  used  in 
the  generation  of  PE  source-destination  address  pairs.  Determining  the  number 
of  simulation  runs  associated  with  each  data  point  is  dependent  upon  the  desired 
accuracy  and  degree  of  confidence  in  the  mean  average  message  delay  value  obtained 
from  the  multiple  runs.  In  addition  to  determining  the  number  of  simulation  runs 
necessary  for  a  given  point,  the  number  of  points  that  are  required  to  accurately 
reflect  the  average  message  delay  curve  characteristics  must  be  determined.  These 
two  decision  factors  must  be  carefully  researched,  through  pilot  simulation  runs,  to 
insure  an  accurate  representation  of  the  networks’  delay  characteristics  are  obtained. 

Each  delay  curve  that  is  obtained  consists  of  five  data  points.  This  number 
is  chosen  due  to  the  recisons  that  follow.  First,  observed  from  the  pilot  simulation 
runs,  the  average  message  delays  for  the  networks  remain  approximately  constant  for 


light  to  medium  network  loading  (the  loading  factor  is  discussed  below).  This  allows 
for  a  minimal  number  of  data  points  to  be  examined  in  these  loading  ranges.  Thus, 
the  emphasis  of  data  point  distribution  can  be  placed  in  the  area  of  the  “knee"  of 
the  message  delay  curve.  The  “knee"  is  the  area  of  the  curve  where  queueing  delays 
become  more  prevalent  but  not  to  include  the  portion  of  the  curve  where  the  network 
is  in  saturation.  Pilot  simulation  runs  are  required  for  each  data  point.  These  runs 
arc  necessary  to  ascertain  the  steady-state  delay  value  for  a  given  load. 

The  steady-state  condition  of  a  system  is  when  the  system's  operating  char¬ 
acteristics  (in  this  case,  the  message  delay)  do  not  change  over  time.  To  reach 
steady-state  in  a  system,  an  initial  “warm-up”  period  is  needed  to  allow  the  system 
to  reach  the  point  of  normal  operation.  SLAM  has  a  built-in  construct  which  aids 
in  determining  the  steady-state  of  a  system.  Using  the  MONTR,  SUMR^'  state¬ 
ment.  a  “snap-shot"  of  the  system  parameters  of  interest  can  be  obtained  for  any 
desired  time  interval.  The  steady-state  condition  is  determined  by  analyzing  the 
intermediate  summary  reports  produced  by  the  MONTR.  SUMR\’  statement.  Once 
the  determination  of  the  time  required  by  the  system  to  reach  steady-state  has  been 
made,  a  second  SLAM  construct  can  be  used  to  clear  the  statistical  values  of  the  net¬ 
work  that  have  been  kept  prior  to  the  steady-state  condition.  The  MONTR,  CLE.\R 
statement  allows  the  designer  to  clear  the  SLAM  statistical  arrays  at  a  specified  time 
into  the  simulation. 

Simulations  of  the  network  models  require  that  each  network  be  operating  in 
the  steady-state  region  prior  to  obtaining  the  delay  curve  characteristics.  Therefore, 
each  network  simulated  requires  the  iterative  process  of  performing  time  "snap-shots" 
of  the  system  and  clearing  the  statistical  arrays  once  the  steady-state  simulation  time 
has  been  determined.  This  requires  exorbitant  amounts  of  CPU  time  to  determine 
the  network  steady-state  point  for  each  of  the  three  networks  examined  and  the 
five  data  points  associated  with  each  average  message  delay  curve  for  a  particular 
network. 


Two  approaches  exist  in  determining  the  number  of  simulation  trials  necessary 
to  ensure  validitj'  of  the  results  obtained.  The  first  approach,  a  mathematical  one. 
requires  that  the  desired  confidence  interval  be  specified  which  will  result  in  the  de¬ 
termination  of  the  number  of  simulation  runs  necessary  to  obtain  this  inter\al.  The 
second  approach  is  to  choose  the  number  of  simulation  runs  to  make  and  allow  the 
confidence  interval  to  directly  result  from  this  choice.  The  latter  approach  is  chosen 
for  this  investigation.  Three  independent  simulation  runs  are  made  for  each  data 
point  used  to  construct  the  average  message  delay  curve  for  a  particular  network 
model.  Resulting  from  these  three  simulation  runs  are  a  set  of  thrc'c  average  mes¬ 
sage  delay  values  whicli  are  to  be  used  in  calculating  the  mean  message  delay  for  a 
particular  load  that  is  placed  on  the  network.  The  standard  deviation  and  variance 
is  then  calculated  from  the  mean  value.  Using  three  simulation  runs  per  data  point, 
the  variance  from  the  mean  message  delay  proves  to  be  reasonably  small.  For  lightly 
loaded  networks  (i.e..  negligible  network  queueing),  the  variance  from  the  mean  is 
less  than  2V(.  wliile  for  heavily  loaded  networks,  the  maximum  variance  is  99c  from  the 
mean.  These  values  indicate,  with  a  high  degree  of  confidence,  that  the  mean  mes¬ 
sage  delay  values  obtained  accurately  represent  the  model  simulated.  These  mean 
message  delay  values  are  used  for  comparison  against  previously  published  works  to 
determine  the  validity  of  the  network  models. 

5.3  Network  Model  Validation 

As  with  any  modeling  effort,  a  major  concern  is  determining  the  validity  of 
the  model.  Two  methods  exist:  compare  simulation  results  with  analytic  models: 
or  compare  simulation  results  against  previously  published  works.  In  validating  the 
multistage  cube  network  and  single  stage  cube  network  models,  the  latter  approach 
is  chosen. 

Ideally,  when  comparing  one's  models  with  previously  published  works,  the  op¬ 
erating  environments  of  the  networks  compared  should  be  equivalent  to  one  another. 
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This  gives  the  basis  for  accurate  comparisons  without  having  to  translate  network 
parameters  which  may  differ.  Unfortunately,  a  direct  comparison  does  not  exist  for 
the  results  obtained  from  the  networks  modeled  for  this  investigation  and  the  results 
obtained  in  previously  published  works.  The  major  difference  in  modeling  techniques 
lie  in  the  manner  in  which  messages  are  generated  and  presented  to  the  network. 

In  the  works  of  [DiJ81,  DaS86,  AbP86.  HsY87],  a  discrete  time  message  gener¬ 
ation  rate  is  assumed.  By  this,  the  probability  of  a  processor  generating  a  message 
in  a  given  cycle  directly  correlates  to  the  message  generation  rate.  If  a  PE's  input 
buffer  to  the  network  is  empty,  then  the  PE  generates  a  message  with  probability  rn^, 
[DiJ8l].  For  example,  a  generation  rate  of  100%  corresponds  to  each  PE  generating 
a  new  message  packet  every  time  cycle  its  input  buffer  is  detected  to  be  empty.  Con¬ 
trolling  the  probability  of  message  generation  thus  controls  the  load  which  is  placed 
on  the  network. 

An  alternative  approach  to  defining  the  message  generation  and  network  satu¬ 
ration  rates  in  terms  of  continuous  time  generation  processes  is  implemented  in  this 
investigation.  Using  an  infinite  buffer  model,  the  restriction  is  removed  of  having  a 
processor’s  network  input  buffer  empty  before  the  generation  of  a  new  message  is 
allowed.  This  aspect  of  the  model  enables  the  effects  of  network  congestion  to  be 
more  accurately  reflected  in  the  operation  of  the  overall  system.  Previous  work  is 
bzised  on  the  observation  that  a  heavily  loaded  network  does  not  empty  all  of  its 
input  buffers  immediately  and.  as  a  result,  “cut-off"  the  arrival  of  new  messages. 
The  time  that  the  inputs  are  cut-off  is  not  included  in  the  published  packet  delay 
statistics.  The  approach  taken  in  this  investigation  serves  to  decouple  the  state  of 
the  network  and  message  generations.  Therefore,  packet  delay  statistics  for  hcavil\ 
loaded  networks  reflect  both  normal  network  queueing  delays  and  the  delay  incurred 
waiting  to  enter  the  network  (a  measure  of  the  time  the  PE  is  idled  in  previous 
work).  Note  that,  in  either  approach,  the  network  congestion  and  the  generation 
process  are  effectively  decoupled  under  light  loading  conditions  when  the  network's 


input  buffers  would  rarely  block  the  arrival  of  new  messages.  In  the  network  models 
created  for  this  investigation,  any  message  that  is  generated  is  queued  in  the  input 
buffer  associated  with  the  generating  PE  if  blockage  occurs  or  is  allowed  to  proceed 
to  the  next  network  buffer  in  route  toward  its  specified  destination  PE. 

To  make  an  accurate  comparison  of  the  average  message  dela\’  times  produced 
in  this  investigation  to  the  delay  times  presented  in  previously  published  works,  a 
correlation  of  the  message  generation  rates  used  is  needed.  This  correlation  proves 
to  be  an  approximation  due  to  the  subtlities  of  modeling  differences  in  the  networks 
in  [DiJ81,  KrS83.  AbP86,  DaS86,  Hs^’87],  as  discussed  above.  In  translating  tin 
message  generation  rates  used  in  this  investigation  to  closely  match  the  rates  of 
the  above  named  works,  it  is  first  necessary  to  compare  the  message  delay  curves 
generated  “in-house”  to  the  delay  curves  presented  in  the  previously  published  works. 
By  overlaying  the  curves  of  the  respective  networks,  and  observing  the  delay  curve 
trends,  an  approximate  correlation  of  the  message  generation  rates  is  established.  As 
a  result,  the  approximate  correlation  is  given  as  follows;  for  a  discrete  probability  of 
generating  a  message,  nig.  where  mg=0.8,  the  corresponding  message  generation  rate 
used  in  this  investigation  is  approximately  2N/3  messages/cycle,  with  N  being  the 
number  of  PEs  in  the  network;  on  the  other  end  of  the  loading  scale,  where  queueing 
is  negligible,  an  mg=0.2  is  approximately  equivalent  to  .\/8  messages/cycle. 

5.3.1  Multistage  Cube  Network  Validation  I  sing  the  above  correlation,  a 
comparison  of  the  average  message  delay  times  for  a  61-PE  multistage  cube  net¬ 
work  built  with  2-by-2  crossbar  switching  elements  serves  as  the  validation  base  for 
a  multistage  cube  netw'ork  of  arbitrary  N  and  crossbar  switching  element  size.  The 
work  of  [KrS83]  is  chosen  for  the  validation  comparisons.  Comparing  the  average 
message  delay  values  produced  by  the  simulator  to  those  of  [KrSS3],  the  average 
message  delay  times  differ  by  at  most  6%  in  the  ca,se  of  a  heavily  loaded  network. 
Further  validity  of  the  model  is  gained  by  maximum  buffer  size  comparisons  with 
[DiJ8l].  For  network  loads  corresponding  to  normal  processor  operating  ranges  (i.e.. 


an  aggregate  message  generation  rate  which  ensures  that  the  ratio  of  message  servi( c 
rate  to  message  generation  rate  does  not  cause  network  saturation),  the  maximum 
network  buffer  sizes  never  exceed  6  packets  in  length.  This  implies,  and  is  shown  in 
[DiJ8l],  that  infinite  buffer  length  models  can  be  used  to  model  networks  comprised 
of  finite  length  buffers  of  small  size.  Deeming  the  64-PE  multistage  cube  netwcnk 
model  valid,  models  for  \  >  64  and  crossbar  switching  element  sizes  greater  than  L’ 
by-2  can  also  be  deemed  valid.  This  validation  step  is  possible  due  to  the  manner  in 
which  the  network  is  modeled.  Any  extension  in  size  of  the  network  or  its  constituent 
crossbar  switching  element  sizes  requires  that  only  size  parameters  be  changed  and 
passed  to  SLAM.  One  Fortran  module  is  used  to  model  the  operation  of  a  crossbar 
switching  element.  A  change  to  the  size  of  a  crossbar  switching  element  results  in 
a  change  in  the  number  of  input  buffers  to  the  switch  that  must  be  examined.  I  he 
buffer  scanning  mechanism  is  logically  identical  for  an  arbitrary  sized  crossbar.  Since 
the  size  parameter  variations  only  require  extensions  to  modules  validated  for  2-by-2 
switching  element  sizes,  network  models  constructed  using  higher  order  switch  sizes 
may  also  be  deemed  valid. 

5.3.2  Th(  Single  Stage  Cube  A'etu'ork  Validation  The  validation  of  the  single 
stage  cube  network  uses  portions  of  the  results  obtained  in  validating  the  multistage' 
cube  network  and  also  uses  the  work  of  [Hs^’87]  as  its  comparative  base.  In  the 
three  interconnection  networks  modeled,  each  uses  crossbar  switching  elements  to 
interconnect  the  system  PEs.  Based  on  the  validation  of  the  crossbar  switching 
element  module,  and  the  message  generation  correlations  discussed  above,  direct 
comparisons  of  the  delay  curves  produced  by  the  simulators  are  made  with  those 
of  [flsY87].  As  with  the  multistage  cube  network,  the  validation  base  network  size 
is  64  PEs.  Comparing  the  average  message  delay  times  for  0.2  <  rug  <  0.8.  the 
simulation  results  are  approximately  equivalent  to  those  of  [HsY87].  .■\t  nig  >  0.8, 
the  differences  between  the  delay  times  are  approximately  10^.  6"  parisons  of  the 
respective  studies  for  N  =  2.')r)  and  N=1024  show  that  the  average  in'  -  -age  delay  times 
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are  approximately  equivalent  up  to  mg=0.7.  For  nig  values  above  0.7.  the  difrert  :ii  ' - 
are  once  again  approximately  109?.  These  delay  value  differences  can  be  attnb;;*'  ! 
to  the  difference  in  message  generation  approaches  of  [Hs\'87]  and  the  one  used  ii, 
this  investigation  as  discussed  previously. 

5.5.5  T/if  Illiac  IV  Setwork  X'alidatwn  The  validation  of  the  lllia*  1\  • 

work  extends  from  the  validation  of  the  two  previously  discussed  models.  .-\s  in  d.e 
multistage  cube  and  single  stage  cube  networks,  the  Illiac  1\'  network  uses  ere— 
bar  switching  elements  for  inter-PE  connections.  1  he  network  models  only  di'b':  i: 
the  interconnection  of  the  PEs  and  the  routing  algorithm  used  in  messace  [)a";;  j 
Xerification  of  the  correctness  of  the  crossbar  switch  implementation  was  presi  no  (i 
aboN'e.  The  verification  of  the  PE  interconnection  and  shortest  path  routini;  n  '  :  ' 
from  extensive  testing  of  the  interconnection  functions  and  the  routing  algoiii!,:: 
Eortran  programs  serve  as  the  base  of  the  interconnection  functions  and  routinu 
algorithm  testing.  Iterative  testing,  in  Fortran,  using  different  source-dest  ina:  a; 
address  pairs,  provides  a  valid  initial  approach  in  determining  the  accuracy  (ff  tie 
interconnection  functions  and  the  routing  algorithm.  A  second  testing  method,  th'' 
SL.AM  TR.\CE  construct,  provides  additional  verification  of  the  model. 

5.4  Individual  .Vetwork  Performance 

In  modeling  the  interconnection  networks,  three  network  sizes  are  chosen  fui 
implementation.  N  =  64.  N  —  256.  and  N  =  1024.  These  network  sizes  are  reju'' 
sentative  of  systems  which  are  technologically  implementable  using  current  inicie 
processor  technology.  In  each  of  the  multistage  cube  models,  a  choice  exists  for  tin 
size  of  crossbar  switching  element  to  be  used  in  the  network  implementation.  In  the 
case  of  a  64-PE  network,  switch  sizes  of  2-by-2,  4-by-4,  and  8-by-8  can  be  used  in 
modeling  the  network.  Multistage  cube  networks  of  size  N  =  256  and  N'  =  1024  can 
also  use  multiple  switch  sizes  in  their  respective  implementations.  For  N  =  256.  tin- 
possible  sizes  are:  2-ljy-2.  4-by-4,  and  16-by-16  while  for  N  =  1024.  "  '  y  2.  4-by-4. 
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and  32-by-32  size  switches  ran  be  used.  Switch  sizes  greater  than  32-by-3'J  are  nut 
technologically  feasible  at  the  present  (due  to  integrated-circuit  pin-out  limit  at  i(jii' i 
and  are  therefore  not  implemented  for  simulation.  By  varying  the  switch  sizes,  the 
performance  characteristics  resulting  from  these  variations  can  be  observed,  for  N 
u  2')b,  and  .\  =  1(J2  !.  crossbar  switch  sizes  of  2-by-2  are  e.xcluded  from  im;>h'meii 
tat  ion  and  evaluation  due  to  the  fact  that  the  minimum  message  delays  associated 
with  each  of  the  implementations  are  greater  than  those  associated  with  the  mesl, 
and  single  stage  cube  networks  of  equal  size. 

I'igure  .3.1  shows  the  family  of  average  message  dela>'  curves  associated  with 
a  fil-PE  multistage  cube  network.  For  comparative  purposes,  tlie  delay  curve  for 
a  ti4-b>-64  crossbar  switch  is  included.  Evident  from  this  figure  is  the  reduction  in 
average  message  delay  times  zis  the  size  of  the  crossbar  switch  is  increased.  This  is 
due  in  large  part  to  reduction  in  the  number  of  stages  that  must  be  traversed  by  a 
messag<’  as  the  crosslrar  switch  size  is  increased.  Also,  as  the  crossbar  switch  size  is 
increased,  the  aggregate  message  arrival  rate  must  bo  increased  to  drive  the  network 
into  saturation.  This  trend  is  in  agreement  with  previous  studies  [PatSl,  KrSS3'. 
that  have  examined  the  effects  of  larger  size  crossbar  switches.  Similar  effects  can 
be  observed  for  larger  size  networks  of  size  .\  =  256.  and  N'  =  1024.  Figures  5.2  and 
■5.3  display  the  average  message  delay  curves  for  these  larger  networks. 

Besides  the  differences  in  interconnection  functions  and  routing  schemes,  sin¬ 
gle  stage  cube  network  implementations  differ  from  multistage  stage  cube  network 
implementations  in  that  the  crossbar  switching  element  size  is  fixed  for  a  chosen  net¬ 
work  size.  Recall  from  Chapter  2,  that  for  a  m-dimension  single  stage  cube  network, 
the  crossbar  switching  elements  that  are  used  for  inter-PE  connections  are  of  size 
(m+l)-by-(m  +  l ).  As  an  example,  consider  an  8-dimension  single  stage  culx'  which 
consists  of  2®  =  256  PEs  and  256  crossbar  switches.  The  size  of  the  crossbar  switch 
needed  for  construction  of  this  network  is  9-by-9. 


Since  only  one  implementation  exists  for  a  given  network  size,  the  single  stai:--  i  ;1" 
network  performance  is  discussed  in  the  network  rom{)arisons  which  follow. 

In  the  construction  of  an  Illiac  I\'  network,  the  crosstiar  switching  elci:,-  ;,' 
is  fixed  at  5-by-h  for  any  allowable  network  size.  Network  sizes  mav  be  the  m]::,!!.- 
any  integer  value  greater  than  1.  For  the  purposes  of  cotiiparisoti  again-'t  the  o-;,.  : 
two  tietworks  modeled,  the  Illiac  I\  network  is  modeled  for  network  size>  id  In-, 
and  .Ns  in  the  case  of  the  single  stage  cube  model,  only  otie  implement  at  iic. 

of  the  Illiac  1\  network  exists  for  a  given  network  size.  Therefore,  the  perfurmaiiie 
discussion  of  the  Illiac  I\’  network  is  presented  in  the  section  which  follows. 

■'i .  ■'  f  t  U'O rk  I\  rf(i nn  a  ncf  C 'o rri pa 

In  analyzitig  the  results  obtained  from  the  three  network  siraulat iotis.  certaii, 
trends  appear  constant  as  the  sizes  of  the  networks  are  varied.  These  trends,  dis¬ 
cussed  below,  allow  for  the  discussion  on  the  comparison  and  analysis  of  the  three 
netivorks  to  be  focu.ssed  on  one  network  size,  where  N  =  256. 

Before  a  comparison  of  the  three  network  models'  message  dela>'  character¬ 
istics  can  be  made,  it  is  necessary  to  understand  how  the  networks  differ  in  the 
minimal  obtainable  message  delays.  Using  a  fixed  message  cycle  time  (the  pticket 
cycle  time),  which  includes  the  processing  time  internal  to  the  crossbar  switch,  the 
minimal  number  of  packet  cycles  that  it  takes  a  message  to  traverse  the  network  i> 
quantifiable.  In  the  multistage  cube  network,  the  number  of  hops  from  the  source 
PE  to  the  destination  PE  is  dependent  upon  the  number  of  stages  in  the  network, 
and  is  explicitly  determined  by  the  size  of  the  crossbar  switching  element  used  in  the 
network.  Consider  a  network  supporting  256  PEs.  For  the  4-by-4  switching  elemeni 
implementation.  4  hops  are  required  to  traverse  the  network  with  a  minimum  delay 
of  4  packet  cycles  (the  delay  will  increa.se  a,s  a  result  of  queueing  actions  within  the 
network  caused  by  an  increase  in  network  load).  For  a  16- by- 16  implementation. 
2  hops  and  a  delay  of  at  least  2  cycles  will  be  incurred  by  a  message  traxersing 
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ihe  network.  The  minimum  number  of  hops  and  packet  cycles  messages  incur  in 
network  traversals  becomes  variable  in  the  single  stage  cube  and  Illiac  I\  netwiiik-. 
The  minimum  number  of  hops  and  pack(*t  cycles  a  message  requires  in  the  single 
stage  cube  and  the  Illiac  I\'  networks  are  1  and  2  respectively.  PEs  that  are  directly 
connected  (\ia  respective  interconnection  functions)  are  considered  one  hop  di>tance 
away  Pom  one  another  but  require  a  minimum  of  two  packet  cycles  to  move  through 
the  crossbar  switching  elements  associated  with  each  PE.  The  maximum  number  of 
hops  required  to  route  a  message  in  the  single  stage  cube  network  is  dependent  upon 
the  dimension  of  the  network.  For  N  =  256,  the  dimension  of  the  cube  is  S.  indicat¬ 
ing  that  the  worst  case  distance  that  must  be  traveled  by  a  message  is  S  hops.  I  he 
minimum  number  of  packet  cycles  required  for  the  worst  case  distance  is  9.  which 
is  equal  to  the  network  dimension  plus  one.  Similarly,  in  the  Illiac  I\  network,  the 
worst  case  hop  distance  is  \/]V  and  the  minimum  number  of  packet  cycles  required 
for  this  worst  case  is  \/?V  +  1.  In  both  cases,  A’  is  the  number  of  PEs  in  the  network. 

Shown  in  Figures  5.4.  5.5.  and  5.6  are  the  average  message  delay  curves  for 
networks  of  size  N  =  64.  N  =  256,  and  N  =  1024  at  various  loading  levels.  Figure  5.5 
is  used  for  the  discussion  of  the  trends  alluded  to  above.  Note  that  for  rc!ativi  l\ 
light  loading  (i.e..  the  aggregate  message  arrival  rate  of  2  packets/cycle),  tlx'  awi.vj' 
message  delay  for  the  single  stage  cube  network  and  the  Illiac  I\  network  .  • 

average  of  the  sum  of  the  minimum  and  maximum  hop  times.  5  and  '*  n-,.. 

This  is  the  lowest  average  delay  time  possible  for  message  a.'',sign<'<!  ■ 

addresses  from  an  uniform  distribution. 


Evident  from  Figure  5.5  is  the  preference  ^l  f  th.'  n. 
two  networks  for  light  to  m('dium  loading  .At  aii  .icr-'  .■ 
of  N/2  packets/cycle,  the  single  stage  <  ut><-  n...e-  '  ■ 

stage  cube  network  constructed  wnl,  5  b. 
characteristics  a,ssociatcd  with  '  ‘  •  • 


in  t  he  sat  urat  k  a: 
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the  sizes  evaluated.  As  the  aggregate  message  arrival  rate  is  increaised,  the  networks 
tend  to  saturate  in  the  following  order:  the  Illiac  IV  network,  the  multistage  cube 
network,  and  then  the  single  stage  cube  network.  Saturation  occurs  first  in  the  Illiac 
IV  network  due  primarily  to  the  greater  average  hop  distance  that  messages  must 
travel  in  traversing  the  network  from  source  PE  to  destination  PE.  For  a  given  mes¬ 
sage  arrival  rate,  the  larger  hop  distances  experienced  in  the  Illiac  IV  cause  more 
message  to  be  in  the  network  at  any  instance  of  time.  This,  in  turn,  causes  increa.se.'^ 
in  message  congestion  resulting  in  increases  in  network  queueing  and  greater  average 
message  delay  times.  In  the  64-PE  and  256-PE  networks,  the  average  hop  distances 
that  message  travel  are  approximately  double  the  hop  distances  of  the  single  stage 
cube  network  and  in  certain  cases,  four  times  the  hop  distances  required  in  the  multi¬ 
stage  cube  network.  This  disparity  in  average  hop  distances  becomes  more  profound 
as  the  network  sizes  increase.  As  a  result,  the  nonsaturation  operating  range  of  the 
Illiac  IN’  network  is  severely  limited  in  comparison  to  the  single  stage  cube  and  the 
multistage  cube  networks. 

The  saturation  characteristics  evident  in  the  single  stage  cube  networks  result 
directly  from  the  average  message  hop  distances  and  the  number  of  packet  buffers 
required  for  the  network  implementation.  While  the  average  message  hop  distance> 
closely  compare  with  those  of  the  multistage  cube  networks,  the  number  of  packet 
buffers  required  to  implement  a  single  stage  cube  can  be  as  much  as  five  times  (for 
N  =  1024).  the  number  required  to  implement  a  multistage  cube  network  of  similar 
size.  In  the  case  of  a  2o6-PE  single  stage  cube  network,  the  number  of  packet  buffers 
reqiiired  is  a  two  fold  increase  over  the  2or)-PE  multistage  cube  network  construf  tnl 
with  4-by-4  crossbar  switching  elements,  and  a  four  fold  increase  over  a  2.') (I  PL.  inul 
tistage  cube  network  built  with  16-by  l6  crossbar  switching  elements.  This  causes 
the  single  stage  cube  network  to  have  less  message  congestion  for  uniform  network 
loading  than  the  multistage  cube  network  of  similar  size,  for  a  gi\eii  aggregate 


message  arrival  rate.  Less  message  congestion  results  in  less  network  queueing  delays 
and  a  greater  nonsaturated  network  operating  range. 

Comparisons  of  the  single  stage  cube  delay  characteristics  with  those  of  the 
multistage  cube  network,  constructed  with  16-by-16  crossbar  switching  elements, 
show  that  the  multistage  cube  network  performs  better  than  the  single  stage  cube 
network  for  light  to  moderately  heavy  network  loading.  This  is  due  to  the  multistage 
cube’s  reduced  minimum  hop  distances  (2),  which  requires  a  heavier  network  load¬ 
ing  to  cause  average  message  delays  to  approximate  those  of  the  single  stage  cube 
network.  Only  when  the  aggregate  message  generation  rate  is  greater  than  2N/3  are 
the  the  delay  characteristics  approximately  equivalent  for  the  single  stage  cube  and 
the  multistage  cube  network  built  with  16-by-16  crossbar  switching  elements.  As 
the  message  load  increases,  the  message  queueing  becomes  more  predominant  in  the 
multistage  cube  network  and  as  a  result,  the  average  message  delay  times  become 
larger  than  those  in  the  single  stage  cube  network  for  a  given  load. 

From  the  discussion  above,  conclusions  can  be  made  concerning  the  implemen¬ 
tation  preference  of  one  interconnection  network  over  another  based  entirely  on  the 
average  message  delay  characteristics  associated  with  each  network.  The  physical 
structure  of  the  Illiac  IV  network  leads  to  inherently  large  average  message  delay 
times  and  restricted  operating  ranges  (network  loads).  If  these  are  the  dominant 
performance  factors  in  a  network's  design,  the  mesh  network  is  not  the  best  choice 
of  network  topologies  for  implementation  given  the  three  investigated  topologies.  .A 
choice  between  the  multistage  cube  network  and  the  single  stage  cube  network  lies 
ultimately  in  the  size  of  the  network  and  the  size  of  the  crossbar  switching  element  to 
be  used  in  the  implementation  of  the  multistage  cube  network.  Recall  from  Chapter 
3.  [AbP86]  showed  that  for  N  =  4096,  crossbar  switching  elements  of  8-by-8  and 
larger  must  be  implemented  in  the  multistage  cube  network  for  it  to  outperform  (in 
terms  of  average  message  delay),  the  single  stage  cube  network.  This  investigation 
{provides  similar  results  where  a  reduction  in  the  size  of  the  network  reduces  the  cross 


bar  switching  element  sizes  of  the  multistage  cube  network  necessary  to  outperform 
the  single  stage  cube  network. 

The  second  performance  parameter  to  be  discussed  in  this  investigation  is  the 
memory  cost  associated  with  implementing  a  chosen  network.  Tables  .5.1.  5.2,  and 
5.3  show  the  hardware  breakdown  associated  with  each  network  and  the  maximum 
packet  buffer  length  requirements  for  a  given  aggregate  message  arrival  rate.  Recall 
from  Section  5.2.  that  the  maximum  packet  buffer  lengths  shown  in  the  tables  listed 
above,  are  the  lengths  necessary  to  ensure  that  99%  of  the  time  that  the  network 
is  in  operation,  these  length  will  not  be  exceeded.  Referring  to  Tables  5.1.  5.2.  and 
5.3  certain  trends  can  be  observed.  Note  the  maximum  packet  buffer  lengths  of 
the  single  stage  cube  network.  For  the  aggregate  message  arrival  rates  chosen,  the 
buffer  sizes  are  equivalent  across  the  network  sizes  evaluated  for  a  given  network 
load.  Comparing  the  maximum  buffer  sizes  of  the  single  stage  cube  network  at 
increased  network  loading  to  the  sizes  required  in  the  multistage  cube  and  Illiac  I\' 
networks  shows  that  the  single  stage  cube  network  requires  buffer  lengths  that  are 
approximately  1/2  to  1/5  less  than  the  requirements  for  the  multistage  cube  and 
Illiac  1\’  networks  respectively. 

This  difference  in  buffer  sizes  is  due  to  the  large  disparity  between  the  number  of 
packet  buffers  that  exist  in  each  of  the  networks.  As  an  example,  consider  the  25r)-PF 
networks.  The  buffer  requirements  of  the  multistage  cube  network  are  512  and  1021 
buffers  for  implementations  using  16  by-16  and  4-by-4  crossbar  switching  elements 
respectively.  The  Illiac  1\'  requires  1280  buffers  while  the  single  stage  cut)e  needs  230  ! 
buffers  for  its  construction.  Having  two  fold  and  four  fold  increases  packet  buffers 
over  the  multistage  cube  networks  of  4  by-4  crossbars  and  16-by-16  crossbars,  allows 
the  messages  entering  the  single  stage  cube  network  to  be  more  widely  distributed 
causing  less  message  congestion  for  uniform  network  loading.  The  reduction  in  packet 
buffer  length  requirements  in  the  single  stage  cube  network  is  not  enough  to  make 
the  single  stage  cube  the  most  economical  netwt>rk  tt)  implement,  dables  5.1,  .5.5. 


Table  5.3.  Buffer  requirements  for  1024-PE  networks  to  avoid  overflow  997^  of  tin 

and  5.G  shows  the  unit  cost  versus  network  load  comparison  for  each  of  the  netwoik 
sizes  investigated.  As  defined  in  Section  5.2,  the  cost  of  a  network  is  the  product  of 
the  number  of  buffers  in  the  network,  the  maximum  buffer  length  requirement  and 
the  unit  dollar  cost  per  specific  buffer  memory  size.  This  cost  estimate  also  assunu's 
that  the  dollar  costs  differences  of  implementing  one  size  crossbar  switching  element 
over  another  are  negligible.  As  a  clarifying  example,  consider  the  networks  of  size 
.N  =  256.  At  light  network  loading,  each  network  requires  that  the  maximum  biiffer 
lengtii  be  one  packet.  The  network  with  the  least  cost  requirement  (512  cost  unit';)  i.^ 
the  multistage  cube  network  constructed  from  16-by-16  crossbar  switching  elements, 
.•\t  heavy  network  loading,  the  16-by-16  crossbar  implementation  of  the  multistage 
cuIh-  still  provides  for  the  lowest  cost  requirement  (5120  cost  units),  but  now  the  Cost 
of  implementing  a  single  stage  cube  network  (9216  cost  units)  is  equivalent  to  the  cost 
of  a  multistage  cube  network  constructed  using  4-by- 1  crossbar  switching  elements. 
Only  at  loads  greater  than  2\/3  packets/cycle  is  the  single  stage  cube  network's  total 
memory  cost  approximately  the  same  as  the  multistage  cube  network's  cost.  1  he 
total  memory  costs  for  the  Illiac  IV  network  are  only  comparable  for  light  network 
loading  due  to  the  increased  queueing  delays  the  network  experiences. 


Network  Crossbar 


switch  size  I  2 


Aggregate  message 
arrival  rate  (pkts/cycle) 


Network  cost  (cost  units) 


multistage 

cube 

2-by-2 

384 

768 

1536 

2304 

3456 

multistage 

cube 

4-by-4 

192 

384 

768 

1152 

2112 

multistage 

cube 

8-by-8 

128 

256 

512 

768 

1280 

crossbar 

64-by-64 

64 

128 

320 

448 

704 

single 
stage  cube 

7-by-7 

448 

896 

1344 

1344 

1792 

Illiac  IV 

5-by-5 

320 

640 

960 

1280 

2240 

Table  5.4.  64-PE  network  cost  per  load  comparison. 


.Network 

Crossbar 
switch  size 

multistage 

cube 

multistage 

cube 

16-by-16 

single 
stage  cube 

9-by-9 

Illiau:  IV 

5-by-5 

Aggregate  message 
arrival  rate  (pkts/cycle) 


2 

Network  cost  ( 

cost  uniti 

s) 

1024 

2048 

3072 

6144  I 

9216 

512 

1024 

2048 

3072 

5120 

2304  j 

4608 

6912 

6912 

9216 

1280 

2560 

7680 

16,640 

28.160 

/VV.cv, 


Network 

Crossbar 

Aggregate  message 
arrival  rate  (pkts/cycle) 

switch  size 

2_ 

N/8 

N/3 

N/2 

2N/3 

Network  cost  (cost  units 

multistage 

cube 

4-by-4 

5120 

5120 

15360 

20480 

30720 

multistage 

cube 

32-by-32 

2048 

4096 

6144 

10240 

14336 

single 
stage  cube 

11-by-ll 

11264 

22528 

33792 

33792 

45056 

Illiac  IV 

5-by-5 

5120 

10240 

51200 

87040 

117760 

Table  5.6.  1024-PE  network  cost  per  load  comparison. 


5.6  Summary 

In  this  chapter,  the  simulation,  validation,  and  performance  comparisons  of  the 
multistage  cube,  the  single  stage  cube,  and  the  Eliac  IV  interconnection  networks 
have  been  presented.  An  in-depth  discussion  of  the  simulation  tools  and  techniques 
provided  information  about  the  machine  requirements  necessary  for  large  scale  sim¬ 
ulations  and  methods  used  to  ensure  the  correctness  of  the  simulations.  Section  5.3 
presented  an  overview  of  validation  techniques  to  include  the  validation  of  the  three 
interconnection  network  models  developed  for  this  investigation.  Individual  network 
[performance  characteristics  were  presented  in  Section  5.4  followed  by  a  performance 
comparison  of  the  network  models.  From  this  comparison,  it  was  observed  that  due 
to  its  physical  structure,  the  Illiac  IV  network  has  inherently  large  message  dela\ 
times  and  restricted  nonsaturation  operating  ranges  as  compared  to  the  single  stage 
cube  anu  multistage  cube  networks.  It  was  also  obser\ed  that,  the  implementation 
choice  of  single  stage  cube  network  or  multistage  cube  network  is  dependent  ui)on 
the  network  size  to  be  implemented  and  the  size  of  the  crossbar  switching  element 
used  in  constructing  the  multistage  cube  network. 
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6.  Conclusions  and  Recommendations 

6.1  Summary  of  Thesis  Investigation 

Before  addressing  the  conclusions  and  recommendations  resulting  from  lhi> 
investigation,  a  brief  review  of  the  material  presented  in  this  thesis  effect  is  netes>ai  \ . 
In  Chapter  1,  a  background  review  of  restrictions  and  limitations  of  the  traditional 
von  .Neumann  computer  w«ls  presented.  The  research  goals  of  this  investigation  wcic 
defined  to  allow  for  an  accurate  scope  of  the  problem  at  hand. 

Chapters  2  and  3  provided  the  background  information  necessary  understand 
the  techniques  used  in  parallel  processing  system  evaluation.  Chapter  2  defined 
the  methodologies  used  in  classifying  parallel  processing  systems  and  the  piublenis 
inherent  with  each  metnodology.  Interconnection  network  topologies  wer('  introdm  ed 
in  Chapter  2  followed  by  discussions  of  contemporary  parallel  processing  systein> 
which  used  these  interconnection  network  implementations.  Chapter  3  examined  the 
slate  of  parallel  processing  systems  evaluations  to  include  crossbar  switch  anal\ ^i^. 
network  switching  methodology  analysis,  and  network  performance  comparison.'-. 

The  methodology  used  in  solving  this  investigation  was  the  topic  of  Chainci  1. 
In  this  chapter,  the  simulation  tool  used  to  model  the  three  interconnection  nc't- 
works  was  discussed.  The  formulation  of  network  models  to  include  interconnect iwn 
.unctions  and  routing  algorithms  were  presented  to  provide  insight  into  the  "inner 
workings"  of  the  simulator.  Chapter  5  began  with  a  discussion  of  the  network  simu¬ 
lation  methodology.  This  discussion  included  the  facilities  necessary  to  perform  the 
investigation  and  the  manner  in  which  the  simulations  were  performed.  .Network 
validation  procedures  were  used  to  ascertain  the  validity  of  the  networks  models  as 
compared  to  previously  published  works.  The  performance  measures  of  average'  nu's- 
sage  delay,  maximum  packet  buffer  lengths,  and  total  network  implementation  costs 
provided  for  the  conclusion  that  the  decision  of  implementing  a  single  stage  cube  net¬ 
work  or  a  multistage  cube  network  requires  forethought  into  the  size  of  the  network 
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to  be  implemented  and  the  size  of  the  crossbar  switching  element  to  be  allowed  for 
implementation  of  the  multistage  cube  network.  Only  when  these  two  factors  have 
been  determined  can  a  clear-cut  choice  be  made.  The  results  obtained  in  Chapter  5 
further  indicate  that  the  Illiac  IV  network’s  physical  structure  makes  it  the  least 
desirable  of  the  networks  investigation  when  considering  the  average  message  delay 
as  the  dominant  performance  factor. 

6.2  Thesis  Effort  Conclusions 

Many  points  can  be  concluded  from  this  thesis  investigation.  First,  this  inves¬ 
tigation  has  provided  a  unified  base  for  the  comparison  of  three  classes  of  intercon¬ 
nection  networks,  which  to  the  present,  has  not  been  done.  A  second  point  is  the 
alternative  approach  to  viewing  the  average  message  delay  and  network  saturation 
rates.  This  alternative  approach  has  provided  for  a  more  accurate  method  of  deter¬ 
mining  the  delay  characteristics  of  a  particular  network.  Thirdly,  this  investigation 
has  shown  that  large-scale  simulations  of  parallel  processing  systems  are  feasible 
using  a  commercially  packaged  simulation  language  such  as  SLAM.  And  lastly,  an 
alternative  method  of  SLAM  file  scanning  has  been  developed  to  add  flexibility  to 
the  language. 

6.6  Recommendations  for  Future  Research 

This  investigation  has  provided  a  comparison  of  three  classes  of  inlrrconnc'c 
tion  networks  under  a  common  set  of  system  operating  assumptions  which  had  not 
been  previously  performed.  Due  to  the  diversity,  complexity,  and  time  restraint>  of 
this  investigation,  certain  simulator  enhancements  could  not  be  implemented.  1  hesc 
enhancements  to  the  simulator  form  a  beise  for  future  research  in  the  area  of  inter¬ 
connection  network  performance  comparisons.  These  proposed  enhancements  are  a,-. 
follows: 


1.  Implement  dynamic  routing  for  the  single  stage  cube  and  Illiac  IV  networks. 
By  doing  so,  the  inherent  message  delay  characteristics  of  the  Illiac  IV  network 
may  be  changed  and  make  this  network  more  desirable  for  implementation. 

2.  Simulate  the  effects  of  nonuniform  loading  on  the  networks.  This  will  allow  for 
the  determination  area.s  of  network  congestion  and  provide  insight  to  possible 
method  for  removing  the  congestion. 

3.  Implement  a  model  which  allows  for  the  partitioning  of  the  multistage  cube 
network. 

In  closing,  this  investigation  has  shown  that  commercial  simulation  packages 
such  as  SL.4M  can  be  used  for  parallel  processing  applications.  Using  simulation 
languages  allow  for  compact  coding  and  ease  of  readability  which  result  in  a  faster 
design  and  implementation  turnaround.  This  thesis  further  serves  as  a  link  which  has 
previously  been  missing  in  the  performance  comparisons  of  classes  of  interconnection 
networks.  As  a  result,  two  technical  papers,  [RaD88aj,  and  [RaD88b].  have  been 
submitted  for  publication. 
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Appendix  A.  Source  Code  Release  Information 


Further  information  concerning  the  SLAM  and  Fortran  source  code  developed 
for  this  investigation  may  be  obtained  by  contacting  Captain  Nathaniel  J.  Davis  1\'  or 
Captain  Wade  H.  Shaw  in  the  Department  of  Electrical  and  Computer  Engineering. 
Air  Force  Institute  of  Technology,  Wright-Patterson  AFB,  OH  45433. 
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